## Data Prep

This notebook is for parsing the data and downloading any files. This is to prep for training a multi-modal embedding model

In [39]:
import pandas as pd
import requests
import json
import uuid

First we will load the json



In [14]:
with open("data/sample_occurences.json", "rb") as f:
    jdata = f.read()
occurences = json.loads(jdata)
occurences

[{'offset': 0,
  'limit': 10,
  'endOfRecords': False,
  'count': 23277,
  'results': [{'key': 4510345615,
    'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7',
    'publishingOrgKey': '28eb1a3f-1c15-4a95-931a-4af90ecb574d',
    'installationKey': '997448a8-f762-11e1-a439-00145eb45e9a',
    'hostingOrganizationKey': '28eb1a3f-1c15-4a95-931a-4af90ecb574d',
    'publishingCountry': 'US',
    'protocol': 'DWC_ARCHIVE',
    'lastCrawled': '2024-05-20T03:40:30.646+00:00',
    'lastParsed': '2024-05-20T17:17:07.278+00:00',
    'crawlId': 459,
    'extensions': {'http://rs.gbif.org/terms/1.0/Multimedia': [{'http://purl.org/dc/terms/format': 'image/jpeg',
       'http://rs.tdwg.org/dwc/terms/catalogNumber': '344229310',
       'http://purl.org/dc/terms/type': 'StillImage',
       'http://purl.org/dc/terms/publisher': 'iNaturalist',
       'http://purl.org/dc/terms/creator': 'Colin Croft',
       'http://purl.org/dc/terms/license': 'http://creativecommons.org/licenses/by/4.0/',
       'http

In [30]:
print("First level of keys:")
print(occurences[0].keys())
print("\n")


occ= occurences[0]
print("Length of Results:")
print(len(occ["results"]))

res = occ["results"][0]
print("Result Keys:")
print(res.keys())

print("\n")
print("Species Id, Name, Generic Name")
print(res["speciesKey"], res["acceptedScientificName"], res["genericName"])
print("\n")

print("Media Length")
print(len(res["media"]))
med = res["media"][0]

First level of keys:
dict_keys(['offset', 'limit', 'endOfRecords', 'count', 'results', 'facets'])


Length of Results:
10
Result Keys:
dict_keys(['key', 'datasetKey', 'publishingOrgKey', 'installationKey', 'hostingOrganizationKey', 'publishingCountry', 'protocol', 'lastCrawled', 'lastParsed', 'crawlId', 'extensions', 'basisOfRecord', 'occurrenceStatus', 'taxonKey', 'kingdomKey', 'phylumKey', 'classKey', 'orderKey', 'familyKey', 'genusKey', 'speciesKey', 'acceptedTaxonKey', 'scientificName', 'acceptedScientificName', 'kingdom', 'phylum', 'order', 'family', 'genus', 'species', 'genericName', 'specificEpithet', 'taxonRank', 'taxonomicStatus', 'iucnRedListCategory', 'dateIdentified', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters', 'continent', 'stateProvince', 'gadm', 'year', 'month', 'day', 'eventDate', 'startDayOfYear', 'endDayOfYear', 'issues', 'modified', 'lastInterpreted', 'references', 'license', 'isSequenced', 'identifiers', 'media', 'facts', 'relations', 'is

## Scripting data

For each record, we are going to pull the taxonomy Id, generic name, and pictures

first, let's define a method that reads in a media dict and downloads the file and returns the file location

In [49]:
def download_media(media_dict):
    """Downloads the picture in the media link (png, jpeg, etc) to the data folder and returns the image
    
    Ketword Arguments:
    media_dict -- a single media object with multiple links from the GBIF dataset
    """
    file_name = media_dict["identifier"]
    extension = file_name.split(".")[-1]

    identifier = str(uuid.uuid4())

    new_file_name = identifier + "." + extension

    file_path = "data/images/" + new_file_name
    
    image_data = requests.get(file_name).content
    
    with open(file_path, "wb") as f:
        f.write(image_data)
    return new_file_name



Now we can parse all the records and save them as a csv along with picture locations as an array

In [52]:
def parse_records(occurences):

    for occurence in occurences:

        database = []

        for record in occurence["results"]:

            record_dict = {}

            record_dict["id"] = record["key"]
            record_dict["speciesKey"] = record["speciesKey"]
            record_dict["scienceName"] = record["acceptedScientificName"]
            record_dict["simpleName"] = record["genericName"]
            record_dict["dataset"] = record["datasetName"]


            media_files = []

            for media in record["media"]:

                file_name = download_media(media)
                media_files.append(file_name)

            record_dict["media"] = media_files

    df = pd.DataFrame(record_dict)
    df.to_csv("data/input.csv")

    return df




In [53]:
parse_records(occurences)