# GBIF Verification of occurence names
This notebook comes after running the models and obtaining on list of potential specie names

## Overview
We want to check if the names found are actual species names and retrieve a dictionnary of couple specie name / gbif link and a list of rejected specie names

# I - Installation of pygbif (Python for GBIF)

This python library is going to easily allow us to do the manipulation we need on the gbif backbone taxonomy

In [1]:
pip install pygbif

Defaulting to user installation because normal site-packages is not writeable
Collecting pygbif
  Downloading pygbif-0.6.6-py3-none-any.whl.metadata (13 kB)
Collecting requests-cache (from pygbif)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting geojson_rewind (from pygbif)
  Downloading geojson_rewind-1.2.1-py3-none-any.whl.metadata (4.5 kB)
Collecting geomet (from pygbif)
  Downloading geomet-1.1.0-py3-none-any.whl.metadata (11 kB)
Collecting appdirs>=1.4.3 (from pygbif)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting click (from geomet->pygbif)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting cattrs>=22.2 (from requests-cache->pygbif)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache->pygbif)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Collecting attrs>=21.2 (from requests-cache->pygbif)
  Downloading attrs-25.4


[notice] A new release of pip is available: 24.3.1 -> 25.3
[notice] To update, run: C:\Users\romai\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


# II - Basic functions provided

In [2]:
from pygbif import species

Let's do an example with Canis lupus, which refers to the basic wolf specie, known as Canis lupus Linnaeus, 1758

In [12]:
name1 = "Canis lupus"
name2 = "Canus lupus"
name3 = "Canos lps"

We have a function that can return a lot of useful information with just a name from a string

In [7]:
res = species.name_backbone(scientificName=name1)
print(res)

{'usage': {'key': '5219173', 'name': 'Canis lupus Linnaeus, 1758', 'canonicalName': 'Canis lupus', 'authorship': 'Linnaeus, 1758', 'rank': 'SPECIES', 'code': 'ZOOLOGICAL', 'status': 'ACCEPTED', 'genericName': 'Canis', 'specificEpithet': 'lupus', 'type': 'SCIENTIFIC', 'formattedName': '<i>Canis</i> <i>lupus</i> Linnaeus, 1758'}, 'classification': [{'key': '1', 'name': 'Animalia', 'rank': 'KINGDOM'}, {'key': '44', 'name': 'Chordata', 'rank': 'PHYLUM'}, {'key': '359', 'name': 'Mammalia', 'rank': 'CLASS'}, {'key': '732', 'name': 'Carnivora', 'rank': 'ORDER'}, {'key': '9701', 'name': 'Canidae', 'rank': 'FAMILY'}, {'key': '5219142', 'name': 'Canis', 'rank': 'GENUS'}, {'key': '5219173', 'name': 'Canis lupus', 'rank': 'SPECIES'}], 'diagnostics': {'matchType': 'EXACT', 'confidence': 99, 'timeTaken': 2, 'timings': {'nameNRank': 0, 'sciNameMatch': 3, 'nameParse': 0, 'luceneMatch': 3}}, 'additionalStatus': [{'clbDatasetKey': '53131', 'datasetAlias': 'IUCN', 'datasetKey': '19491596-35ae-4a91-9a98-8

In the case of errors : 

In [10]:
res_false1 = species.name_backbone(scientificName=name2)
print(res_false1)

{'usage': {'key': '5219173', 'name': 'Canis lupus Linnaeus, 1758', 'canonicalName': 'Canis lupus', 'authorship': 'Linnaeus, 1758', 'rank': 'SPECIES', 'code': 'ZOOLOGICAL', 'status': 'ACCEPTED', 'genericName': 'Canis', 'specificEpithet': 'lupus', 'type': 'SCIENTIFIC', 'formattedName': '<i>Canis</i> <i>lupus</i> Linnaeus, 1758'}, 'classification': [{'key': '1', 'name': 'Animalia', 'rank': 'KINGDOM'}, {'key': '44', 'name': 'Chordata', 'rank': 'PHYLUM'}, {'key': '359', 'name': 'Mammalia', 'rank': 'CLASS'}, {'key': '732', 'name': 'Carnivora', 'rank': 'ORDER'}, {'key': '9701', 'name': 'Canidae', 'rank': 'FAMILY'}, {'key': '5219142', 'name': 'Canis', 'rank': 'GENUS'}, {'key': '5219173', 'name': 'Canis lupus', 'rank': 'SPECIES'}], 'diagnostics': {'matchType': 'VARIANT', 'confidence': 85, 'timeTaken': 2, 'timings': {'nameNRank': 0, 'sciNameMatch': 3, 'nameParse': 0, 'luceneMatch': 3}}, 'additionalStatus': [{'clbDatasetKey': '53131', 'datasetAlias': 'IUCN', 'datasetKey': '19491596-35ae-4a91-9a98

In [13]:
res_false2 = species.name_backbone(scientificName=name3)
print(res_false2)

{'diagnostics': {'matchType': 'NONE', 'issues': [], 'confidence': 100, 'timeTaken': 1, 'timings': {'sciNameMatch': 2}}, 'synonym': False}


In [6]:
res_false3 = species.name_backbone(scientificName="Canis Lupus")
res_false4 = species.name_backbone(scientificName="canis lupus")

print(res_false3)
print(res_false4)

{'diagnostics': {'matchType': 'NONE', 'issues': [], 'confidence': 100, 'note': 'No match because of too little confidence', 'timeTaken': 31, 'timings': {'sciNameMatch': 32}}, 'synonym': False}
{'usage': {'key': '5219173', 'name': 'Canis lupus Linnaeus, 1758', 'canonicalName': 'Canis lupus', 'authorship': 'Linnaeus, 1758', 'rank': 'SPECIES', 'code': 'ZOOLOGICAL', 'status': 'ACCEPTED', 'genericName': 'Canis', 'specificEpithet': 'lupus', 'type': 'SCIENTIFIC', 'formattedName': '<i>Canis</i> <i>lupus</i> Linnaeus, 1758'}, 'classification': [{'key': '1', 'name': 'Animalia', 'rank': 'KINGDOM'}, {'key': '44', 'name': 'Chordata', 'rank': 'PHYLUM'}, {'key': '359', 'name': 'Mammalia', 'rank': 'CLASS'}, {'key': '732', 'name': 'Carnivora', 'rank': 'ORDER'}, {'key': '9701', 'name': 'Canidae', 'rank': 'FAMILY'}, {'key': '5219142', 'name': 'Canis', 'rank': 'GENUS'}, {'key': '5219173', 'name': 'Canis lupus', 'rank': 'SPECIES'}], 'diagnostics': {'matchType': 'EXACT', 'confidence': 99, 'timeTaken': 2, 't

We can understand that we have different matchType status being EXACT, VARIANT and NONE => definiton of our acceptance of mispelling 

Now to obtain the website link from the function

We observe that in the usage case we can find a key which can be used to link to the specie page

In [16]:
print(res.get("usage"))
key = res["usage"]["key"]
print(key)

print(f"GBIF taxonomy page: https://www.gbif.org/species/{key}")

{'key': '5219173', 'name': 'Canis lupus Linnaeus, 1758', 'canonicalName': 'Canis lupus', 'authorship': 'Linnaeus, 1758', 'rank': 'SPECIES', 'code': 'ZOOLOGICAL', 'status': 'ACCEPTED', 'genericName': 'Canis', 'specificEpithet': 'lupus', 'type': 'SCIENTIFIC', 'formattedName': '<i>Canis</i> <i>lupus</i> Linnaeus, 1758'}
5219173
GBIF taxonomy page: https://www.gbif.org/species/5219173


# III - Fonction de vérification

In [None]:
def verif(names_list) :
    correct_occurences = {}
    misspelled_ocrurrences = []
    wrong_occurences = []
    # We initialize one dictionnary for the correct specie names with
    # the gbif link, and two other lists for variant spellings and
    # false occurences


    # On itère sur toute la liste de noms et on vérifie le status de
    # la recherche pour décider dans quelle liste ajouter le nom
    for name in names_list : 
        res = species.name_backbone(scientificName=name)
        if res["diagnostics"]["matchType"] == "EXACT" :
            correct_occurences[res["usage"]["canonicalName"]] = f"GBIF taxonomy page - https://www.gbif.org/species/{res["usage"]["key"]}"
        elif res["diagnostics"]["matchType"] == "VARIANT" :
            misspelled_ocrurrences.append([name, res["usage"]["canonicalName"], f"GBIF taxonomy page - https://www.gbif.org/species/{res["usage"]["key"]}"])
        else : 
            wrong_occurences.append(name)

    return(correct_occurences, misspelled_ocrurrences, wrong_occurences)
    

Ainsi, on obtient en sortie le dictionnaire avec la liste de noms correctement identifiés et leur lien, une liste de noms trouvés de manière approximative par le site gbif avec le vrai nom de l'espèce et le nom relevé par les modèles pour pouvoir effectuer une comparaison après, et une liste de noms rejetés

Essai manuel

In [34]:
test = ["Canis lupus", "Equus ferus", "Equus Caballus", "Equus caballus", "Canus lupos", "Canus lps"]

In [35]:
a, b, c = verif(test)

In [36]:
print(a)
print(b)
print(c)

{'Canis lupus': 'GBIF taxonomy page: https://www.gbif.org/species/5219173', 'Equus ferus': 'GBIF taxonomy page: https://www.gbif.org/species/4409270', 'Equus caballus': 'GBIF taxonomy page: https://www.gbif.org/species/2440886'}
[['Canus lupos', 'Canis lupus', 'GBIF taxonomy page: https://www.gbif.org/species/5219173']]
['Equus Caballus', 'Canus lps']


On peut déjà noter que le capitalisation pose problème et un nom correct mais mal capitalisé passe comme non détecté