This is the process for creating the entire dataset, version 3.3 (which we used to train BioCLIP for the public release).
download_data
:- Run
bash scripts/download_data.sh
to download most of the metadata files.
- Run
make_mapping
:- Creates the sqlite database that maps from original files to tree of life ids.
- Run
python scripts/evobio10m/make_mapping.py --tag v3.3 --workers 8
- Can run on login nodes and should take several hours. If you want it much faster, you can queue it on slurm with more workers.
make_splits
:- Adds the splits table to the sqlite database: marks each image as belonging to either val or train, and then picks out 10% of the training images to use as an ablation study.
- Run
python scripts/evobio10m/make_splits.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite --val-split 5 --train-small-split 10 --seed 17
- This will run quickly on a login node.
make_metadata
:- Creates all the metadata files that can be easily used by
make_wds.py
. - Also makes a
predicted-catalog.csv
file that will closely mimiccatalog.csv
(described below).predicted-catalog.csv
includes rows for the rare species which are not included incatalog.csv
.- See ToL-EDA HF Repo for more information about these files.
- Run
python scripts/evobio10m/make_metadata.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
- Creates all the metadata files that can be easily used by
check_taxa
:- This will check the predicted catalog file for any taxa issues. If there are major issues, fix them first.
- Run
python scripts/evobio10m/check_taxa.py /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/predicted-catalog.csv
make-dataset-wds
:- This actually creates the webdataset files by running
make_wds
for each of the splits. - Run
sbatch slurm/make-dataset-wds.sh
on Pitzer.- This runs the
scripts/evobio10m/make_wds.py
for each of the splits using 32 workers. - It takes a long time (6 hours) and requires lots of memory.
- This runs the
- This actually creates the webdataset files by running
check_wds
:- Checks for bad shards and records them.
- Run
scripts/evobio10m/check_wds.py --shardlist SHARDS --workers 8 > logs/bad-shards.txt
- Writes a list of bad shards to
logs/bad-shards.txt
.
- Writes a list of bad shards to
make_catalog
:- Generates the catalog of all images in the dataset, which includes information about their original data source and taxonomic record.
- Run
python scripts/evobio10m/make_catalog.py --dir /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/224x224/ --workers 8 --batch-size 256 --tag v3.3 --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
- Creates a file
catalog.csv
in--dir
which is a list of all names in the webdataset.
- Creates a file
check_taxa
:- This will check the actual catalog file for any taxa issues.
- More information on this file can be found here.
- Run
python scripts/evobio10m/check_taxa.py /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/catalog.csv
This process is buggy and doesn't always work.
make_wds.py
tries to re-write wds files that are corrupted, but it doesn't always work.
make_wds.py
also ignores images and species used in the rare species benchmark.