### Introduction
This notebook was used to analyze the data in the GBIF Occurrence dataset, extracting and filtering relevant information. Additionally, scientific names were mapped using the GBIF API, complemented by manual adjustments where necessary.

### Analyze Dataset Structure

In [None]:
import pandas as pd

gbif_occurences_path = r"D:\OneDrive\Universität\Schrank\Master\2. Semester\Data Integration\data_integration\analysis\data_sources\gbif_occurences.csv"

# Read the csv file
gbif_occurences = pd.read_csv(gbif_occurences_path, delimiter='\t')

# Drop columns where all values are NaN
gbif_occurences = gbif_occurences.dropna(axis=1, how='all')


In [23]:
# Step 3: Get all the columns of the dataframe
print(gbif_occurences.columns)

# Print out family, genus, Scientific Name, and scientificName for the first 10 rows
print("family, genus, species, and scientificName")
print(gbif_occurences[['family', 'genus', 'species', 'scientificName']].head(10))

# Check if every species has a unique taxon key
print(len(gbif_occurences['taxonKey'].unique() == len(gbif_occurences['scientificName'].unique())))

# Print out all column attributes, check if their values are relevant for analyis
for column in gbif_occurences.columns:
    print(column)
    print(gbif_occurences[column].head())
    print(column + " unique values")
    print(gbif_occurences[column].unique())

Index(['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class',
       'order', 'family', 'genus', 'species', 'infraspecificEpithet',
       'taxonRank', 'scientificName', 'verbatimScientificName', 'countryCode',
       'locality', 'occurrenceStatus', 'publishingOrgKey', 'decimalLatitude',
       'decimalLongitude', 'coordinateUncertaintyInMeters', 'elevation',
       'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day',
       'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord',
       'institutionCode', 'collectionCode', 'catalogNumber', 'identifiedBy',
       'dateIdentified', 'license', 'rightsHolder', 'recordedBy',
       'lastInterpreted', 'issue'],
      dtype='object')
family, genus, species, and scientificName
          family        genus               species  \
0  Holocentridae  Myripristis   Myripristis jacobus   
1     Serranidae   Diplectrum   Diplectrum formosum   
2    Cheloniidae      Caretta       Caretta caretta   
3   Aulostomidae  

There are a lot of columns here. The following attributes could be relevant for a scuba diving recommender system? Which columns can I probably just drop?

Relevant columns:
- **decimalLatitude** and **decimalLongitude**
- **taxonKey** Ein TaxonKey in der Global Biodiversity Information Facility (GBIF) identifiziert eine Art eindeutig, jedoch nur innerhalb des GBIF-Systems
- **species**: This is the scientific name of the fish (the column ScientificName includes also the year of discovery). We will map this to a more common name of the species, e.g. Diplectrum formosum -> Sägebarsch (row 1)
- 
- **eventDate**, **dateIdentified**: Seasonality?
- **depth**



Maybe relevant:
- countryCode, locality, stateProvince
**kingdom** is ['Animalia' 'incertae sedis' 'Plantae' 'Chromista']
maybee??
- depthAccuracy: Value of potential error of the depth
- elevation


- speciesKey Identifier?



Irrelevant columns:
- gbifID, datasetKey, occurenceID
- phylum 
- order 
- genus, scientificName, family (redundant)
- infraspecificEpithet
- day, month, year (redundant to eventDate)
- occuranceStatus (only PRESENT)
- elevationAccuracy (only 0)
- publishingOrgKey
- taxonRank: Technical taxonomic data, which is more useful for researchers than recreational divers.
- coordinateUncertaintyInMeters (is always 100)
- basisOfRecord: it is always human observation...
- institutionCode, collectionCode, catalogNumber
- identifiedBy
- license, rightsHolder, recordedBy: no one cares
- lastInterpreted: administrative data
- issue: administrative column

### Add Common Names using API

We want to add common (vernacular) species names using the following API. As a usageKey we can use the taxonKey found in our data source.

https://techdocs.gbif.org/en/openapi/v1/species#/Species/getNameUsageVernacularNames

In [None]:

import requests
import pandas as pd
from collections import Counter
import time

gbif_occurences_path = r"data_sources/gbif_occurences.csv"

# Read the csv file
gbif_occurences = pd.read_csv(gbif_occurences_path, delimiter='\t')

# Drop columns where all values are NaN
gbif_occurences = gbif_occurences.dropna(axis=1, how='all')

# Create a new dataframe with only the unique species

# Get the scientific names of the species with the corresponding taxon keys
# species_and_taxonKey includes columns scientificName and taxonKey from gbif_occurences
api_species_mapping = gbif_occurences[['scientificName', 'taxonKey']].drop_duplicates()

# Save the output to a new CSV file
output_path = r"cleaned_data/intermediate steps/species_taxonKeys.csv"
api_species_mapping.to_csv(output_path, index=False)


# Log the start of the process
print("🟢 Name Mapping started")
total_species = len(api_species_mapping)
print(f"🔍 Searching for {total_species} Scientific Names")

def get_vernacular_names(usage_key):
    """
    Fetch common names for a given usage key using /species/{usageKey}/vernacularNames.
    Prioritize English names, then German, then other languages.
    """
    if pd.isna(usage_key):
        print("   ⚠️ No usage key provided.")
        return "No common name found"
    
    print(f"   🔄 Fetching common names for usage key: {usage_key}...")
    vernacular_url = f"https://api.gbif.org/v1/species/{usage_key}/vernacularNames"
    response = requests.get(vernacular_url)
    
    if response.status_code == 200:
        names = response.json().get('results', [])
        
        # Collect names with the language
        vernacular_names = [name['vernacularName'] for name in names]
        
        return vernacular_names
    else:
        print(f"   ⚠️ Request for common names failed with status code: {response.status_code}")
    
    return "No common name found"

# Ensure no duplicate entries and reset index
api_species_mapping.drop_duplicates(subset=['taxonKey'], inplace=True)
api_species_mapping.reset_index(drop=True, inplace=True)

# Fetch common names for each species with progress tracking
start_time = time.time()
for index, row in api_species_mapping.iterrows():
    species_name = row.get('scientificName')
    species_key = row.get('taxonKey')
    
    if pd.isna(species_key):
        print(f"⚠️ Missing taxonKey for '{species_name}', skipping...")
        continue
    
    print(f"🔍 Processing ({index + 1}/{total_species}): {species_name}")
    
    # Fetch possible vernacular names using the existing taxonKey
    vernacular_names = get_vernacular_names(species_key)
    
    # Update the DataFrame with the list of possible names
    api_species_mapping.at[index, 'Vernacular Names List'] = ','.join(vernacular_names)

# Save the output to a new CSV file
output_path = r"cleaned_data/intermediate steps/species_taxonKeys_with_possible_names.csv"
api_species_mapping.to_csv(output_path, index=False)

# Display completion message with timing information
end_time = time.time()
elapsed_time = end_time - start_time
print(f"🟢 Name Mapping completed in {elapsed_time:.2f} seconds")
print(f"💾 Results saved to '{output_path}'")

# Display the first few rows of the result
print(api_species_mapping.head())


🟢 Name Mapping completed in 554.67 seconds
💾 Results saved to 'cleaned_data/intermediate steps/species_taxonKeys_with_possible_names.csv'
                            scientificName  taxonKey  \
0         Myripristis jacobus Cuvier, 1829   2357064   
1     Diplectrum formosum (Linnaeus, 1766)   5210229   
2         Caretta caretta (Linnaeus, 1758)   8894817   
3  Aulostomus maculatus Valenciennes, 1841   2332595   
4       Sargocentron coruscum (Poey, 1860)   2356835   

                               Vernacular Names List  
0  ['Baga-baga', 'Bagsang', 'Bartolito', 'Bastard...  
1  ['Arenero', 'Bolo', 'Canguito', 'Jacundá', 'Ja...  
2  ['Avó-de-Aruanã', 'Avó-de-Aruanã', 'Cabeça-Gra...  
3  ['Atlantic trumpetfish', 'Atlantic trumpetfish...  
4  ['Candil rayado', 'Candil rayado', 'Carajuelo'...  


### USE CHATGPT

We use ChatGPT to find an english vernacular name for the species.

In [None]:


import pandas as pd
import numpy as np
from openai import OpenAI
import ast
import os

api_species_mapping_path = r"cleaned_data/intermediate steps/species_taxonKeys_with_possible_names.csv"
#gbif_occurences_path = r"data_sources/gbif_occurences.csv"

# Read the csv file
gpt_species_mapping = pd.read_csv(api_species_mapping_path, delimiter=',')

# Create a Dataframe with only the unique species and a column for the common name
# species_mapping = gbif_occurences['species'].drop_duplicates()
# species_mapping = species_mapping.to_frame()
gpt_species_mapping['commonName'] = np.nan
gpt_species_mapping['commonName'] = gpt_species_mapping['commonName'].astype(object)  # Ensure dtype compatibility


# Required: Save your OpenAI API key as an environment variable in powershell
# setx OPENAI_API_KEY "your_api_key_here"

# Ensure the API key is set
api_key = os.getenv("OPENAI_API_KEY")
if api_key is None:
    raise ValueError("OPENAI_API_KEY environment variable is not set")
else:
    print("API key is set")

client = OpenAI()

for i in range(len(gpt_species_mapping)):

    # array with all species names:
    scientificName = gpt_species_mapping.iloc[i]['scientificName']
    vernacularNames = ast.literal_eval(gpt_species_mapping.iloc[i]['Vernacular Names List'])
    vernacularNames = ', '.join(vernacularNames)
    
    print(f"Processing species {i+1}/{len(gpt_species_mapping)}")
    print(f"Species name: {scientificName}")

    prompt = f"Provide only the most common english name of the following species! The scientific Name of the Species is {scientificName} and different vernacular names of the species are {vernacularNames}. Provide nothing else besides the most common english name."

    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"{prompt}"}]
    )
    result = completion.choices[0].message.content

    #result = "🐟"

    print(f"Common name for {scientificName}: {result}")

    gpt_species_mapping.loc[i, 'commonName'] = result
gpt_species_mapping.drop(columns=['Vernacular Names List'], inplace=True)
gpt_species_mapping.drop(columns=['taxonKey'], inplace=True)


# Save the species Mapper to a new CSV file
gpt_species_mapping.to_csv(r"cleaned_data/gpt_species_mapping.csv", index=False)


# Reduce the dataframe to the columns decimalLatitude, decimalLongitude, species, eventDate, dateIdentified, depth
reduced_gbif_occurences = gbif_occurences[['decimalLatitude', 'decimalLongitude', 'scientificName', 'eventDate', 'dateIdentified', 'depth']]

# Add the common name to the reduced dataframe: Get the species value the occurence, find the corresponding common name in the species_mapping dataframe
reduced_gbif_occurences['commonName'] = reduced_gbif_occurences['scientificName'].map(gpt_species_mapping.set_index('scientificName')['commonName'])

# Count the number of missing common names
missing_common_names = reduced_gbif_occurences['commonName'].isnull().sum()
print(f"🔍 Found {missing_common_names} missing common names.")


# Export the reduced dataframe to a new csv file
reduced_gbif_occurences.to_csv(r"cleaned_data/gbif_occurences_cleaned.csv", index=False)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_gbif_occurences['commonName'] = reduced_gbif_occurences['scientificName'].map(gpt_species_mapping.set_index('scientificName')['commonName'])


🔍 Found 0 missing common names.


## Alter Code (nicht ausführen)

### Add Common Names to GBIF Dataset

We reduce the GBIF dataset to relevant columns we have found before. Then we add the common names as an additional column to the dataset.

In [54]:
# Reduce the dataframe to the columns decimalLatitude, decimalLongitude, species, eventDate, dateIdentified, depth
reduced_gbif_occurences = gbif_occurences[['decimalLatitude', 'decimalLongitude', 'scientificName', 'eventDate', 'dateIdentified', 'depth']]

# Add the common name to the reduced dataframe
reduced_gbif_occurences['commonName'] = reduced_gbif_occurences['scientificName'].map(species_mapping.set_index('scientificName')['Common Name'])

# Count the number of missing common names
missing_common_names = reduced_gbif_occurences['commonName'].isnull().sum()
print(f"🔍 Found {missing_common_names} missing common names.")

# Export the reduced dataframe to a new csv file
reduced_gbif_occurences.to_csv(r"D:\OneDrive\Universität\Schrank\Master\2. Semester\Data Integration\data_integration\analysis\cleaned_data\gbif_occurences_cleaned.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reduced_gbif_occurences['commonName'] = reduced_gbif_occurences['scientificName'].map(species_mapping.set_index('scientificName')['Common Name'])


🔍 Found 0 missing common names.


notizen

### Add Common Names manually

Because 574 entries are missing, they need to be filled manually.

In [None]:
# Filter out missing rows

missing_entries = species_mapping['scientificName'][species_mapping['Common Name'] == 'No common name found']

print(f"🔍 Found {len(missing_entries)} missing entries:"
      f"\n{missing_entries}")

# Save the missing entries to a new CSV file

missing_entries_path = r"D:\OneDrive\Universität\Schrank\Master\2. Semester\Data Integration\data_integration\analysis\cleaned_data\intermediate_steps\missing_entries.csv"

missing_entries.to_csv(missing_entries_path, index=False)

print(f"💾 Missing entries saved to '{missing_entries_path}'"
      f"\n🚨 Please review and update the missing entries manually.")



🔍 Found 574 missing entries:
22                Carangoides bartholomaei (Cuvier, 1833)
34      Chaetonotus napoleonicus Balsamo, Todaro & Ton...
52                 Scyliorhinus canicula (Linnaeus, 1758)
75                                   Octopus Cuvier, 1798
91                     Pareledone charcoti (Joubin, 1905)
                              ...                        
4077                 Nereiphylla paretti Blainville, 1828
4079                     Pagurus liochele (Barnard, 1947)
4085       Synapta maculata (Chamisso & Eysenhardt, 1821)
4086          Balanophyllia bonaespei van der Horst, 1938
4092                  Iphimedia gibba (K.H.Barnard, 1955)
Name: scientificName, Length: 574, dtype: object
💾 Missing entries saved to 'D:\OneDrive\Universität\Schrank\Master\2. Semester\Data Integration\data_integration\analysis\cleaned_data\missing_entries.csv'
🚨 Please review and update the missing entries manually.


### Summarize species mapping

We combine our manual filled species names with the species names we have found using the API.

In [None]:
# Load the filled missing entries CSV file

filled_entries_path = r"D:/OneDrive/Universität/Schrank/Master/2. Semester/Data Integration/data_integration/analysis/cleaned_data/intermediate_steps/filled_entries.csv"
filled_entries = pd.read_csv(filled_entries_path, sep=';')


# Display the first few rows of the filled entries
print(filled_entries.head())


# Add the filled entries to the original DataFrame
for index, row in filled_entries.iterrows():
    scientific_name = row.get('scientificName')
    common_name = row.get('Common Name')

    # Find the matching row in the original DataFrame
    match = species_mapping['scientificName'] == scientific_name
    species_mapping.loc[match, 'Common Name'] = common_name

# Count the number of missing entries after filling
entries_without_name = species_mapping['Common Name'] == 'No common name found'
print(f"🔍 Found {entries_without_name.sum()} missing entries after filling.")

# Save the updated DataFrame to a new CSV file
output_path = r"D:/OneDrive/Universität/Schrank/Master/2. Semester/Data Integration/data_integration/analysis/cleaned_data/intermediate_steps/species_taxonKeys_with_common_names_filled.csv"
species_mapping.to_csv(output_path, index=False)



                                      scientificName             Common Name
0            Carangoides bartholomaei (Cuvier, 1833)             Yellow Jack
1  Chaetonotus napoleonicus Balsamo, Todaro & Ton...           Napoleon Fish
2             Scyliorhinus canicula (Linnaeus, 1758)  Small-Spotted Catshark
3                               Octopus Cuvier, 1798          Common Octopus
4                 Pareledone charcoti (Joubin, 1905)       Charcot's Octopus
🔍 Found 0 missing entries after filling.


https://www.catalogueoflife.org/data/taxon/4VWR2

https://www.catalogueoflife.org/2022/08/15/archive-repository
https://www.checklistbank.org


https://api.checklistbank.org

Scientific Name: e.g. Scorpaena elongata Cadenat, 1943


Catalogue of Life

https://www.catalogueoflife.org/data/taxon/4VWR2

This Site lists Vernacular names in various languages.

This site requires a datasetKey??? I dont have that



Fishbase

https://fishbase.de/summary/5021

lists english name right below

API returns 403 Forbidden

