# Preprocessing Notebook

## Load Dataset

Data available for download at https://www.gbif.org/occurrence/download/0004691-251025141854904

Covers 2015-2020 in South America. Over 30 million records.

In [None]:
import pandas as pd

eod_data = pd.read_csv('../data/0004691-251025141854904.csv', delimiter='\t')
eod_data.head(n=2)

Unnamed: 0,gbifID,datasetKey,occurrenceID,kingdom,phylum,class,order,family,genus,species,...,identifiedBy,dateIdentified,license,rightsHolder,recordedBy,typeStatus,establishmentMeans,lastInterpreted,mediaType,issue
0,3177189382,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,URN:catalog:CLO:EBIRD:OBS966912869,Animalia,Chordata,Aves,Passeriformes,Parulidae,Basileuterus,Basileuterus culicivorus,...,,,CC_BY_4_0,,obsr918297,,,2025-10-08T14:29:54.530Z,,CONTINENT_DERIVED_FROM_COORDINATES;TAXON_CONCE...
1,2045649326,4fa7b334-ce0d-4e88-aaae-2e0c138d049e,URN:catalog:CLO:EBIRD:OBS487895750,Animalia,Chordata,Aves,Accipitriformes,Cathartidae,Coragyps,Coragyps atratus,...,,,CC_BY_4_0,,obsr204697,,,2025-10-08T14:29:54.530Z,,CONTINENT_DERIVED_FROM_COORDINATES;TAXON_CONCE...


## Remove unnecessary columns


In [2]:
print(eod_data.columns.tolist())
print(len(eod_data.columns))

['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'infraspecificEpithet', 'taxonRank', 'scientificName', 'verbatimScientificName', 'verbatimScientificNameAuthorship', 'countryCode', 'locality', 'stateProvince', 'occurrenceStatus', 'individualCount', 'publishingOrgKey', 'decimalLatitude', 'decimalLongitude', 'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation', 'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day', 'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord', 'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber', 'identifiedBy', 'dateIdentified', 'license', 'rightsHolder', 'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted', 'mediaType', 'issue']
50


In [3]:
keep_cols = {
    'genus','species',
    'countryCode','locality','stateProvince',
    'individualCount',
    'decimalLatitude','decimalLongitude',
    'eventDate','recordedBy',
    }

refined_eod_data = eod_data.drop(columns=[col for col in eod_data.columns if col not in keep_cols])
refined_eod_data['eventDate'] = pd.to_datetime(refined_eod_data['eventDate'], errors='coerce')
refined_eod_data.head(n=2)

Unnamed: 0,genus,species,countryCode,locality,stateProvince,individualCount,decimalLatitude,decimalLongitude,eventDate,recordedBy
0,Basileuterus,Basileuterus culicivorus,CO,Finca La Esmeralda - El Cairo - Valle del Cauc...,Valle del Cauca,1.0,4.733494,-76.20964,2020-07-04,obsr918297
1,Coragyps,Coragyps atratus,TT,Arena Forest,Couva-Tabaquite-Talparo,,10.569455,-61.248222,2017-03-29,obsr204697


Note: EOD does not provide unique checklist identifiers. We will approximate this by grouping on eventDate, countryCode, locality, and stateProvince. Together these will roughly designate a unique sampling event for our use case...

In [8]:
checklist_cols = ['eventDate', 'countryCode', 'locality', 'stateProvince']
refined_eod_data['checklist_id'] = refined_eod_data.groupby(checklist_cols).ngroup(ascending=True)
refined_eod_data.head(n=2)
print(f'Number of checklists: {refined_eod_data["checklist_id"].nunique()}')

Number of checklists: 1032227


## Convert to parquet file format

CSV is inefficient for storing/loading such a large dataset. We'll convert to parquet format to increase efficiency when loading/storing data.

In [9]:
refined_eod_data.to_parquet('/home/noahg/MATH316/project2/data/0004691-251025141854904.parquet')

In [10]:
# test loading parquet file - should be much faster than loading CSV
test_data = pd.read_parquet('/home/noahg/MATH316/project2/data/0004691-251025141854904.parquet')
test_data.head(n=5)

Unnamed: 0,genus,species,countryCode,locality,stateProvince,individualCount,decimalLatitude,decimalLongitude,eventDate,recordedBy,checklist_id
0,Basileuterus,Basileuterus culicivorus,CO,Finca La Esmeralda - El Cairo - Valle del Cauc...,Valle del Cauca,1.0,4.733494,-76.20964,2020-07-04,obsr918297,893563.0
1,Coragyps,Coragyps atratus,TT,Arena Forest,Couva-Tabaquite-Talparo,,10.569455,-61.248222,2017-03-29,obsr204697,202736.0
2,Egretta,Egretta caerulea,TT,Rahamut Trace,Siparia,2.0,10.2009,-61.481125,2020-09-06,obsr1080831,936373.0
3,Setophaga,Setophaga pitiayumi,AR,Cañada de Grass,Santa Fe,1.0,-29.918184,-60.310024,2020-05-01,obsr914504,843373.0
4,Zebrilus,Zebrilus undulatus,BR,Alta Floresta--rio Santa Helena,Mato Grosso,2.0,-9.926295,-56.308765,2016-02-20,obsr412230,86491.0
