# Overview of the WildlifeReID-10k dataset

This notebook show a basic analysis of the [WildlifeReID-10k](https://www.kaggle.com/datasets/wildlifedatasets/wildlifereid-10k) dataset. We first analyze the dataset and then show to evaluate an already trained model. We first import the required packages.

In [1]:
import os
print(os.getcwd())
!pip install git+https://github.com/WildlifeDatasets/wildlife-datasets@develop git+https://github.com/WildlifeDatasets/wildlife-tools

c:\Hub\DAT550-Animal-CLEF\Exploration
Collecting git+https://github.com/WildlifeDatasets/wildlife-datasets@develop
  Cloning https://github.com/WildlifeDatasets/wildlife-datasets (to revision develop) to c:\users\trade\appdata\local\temp\pip-req-build-hek_ycds
  Resolved https://github.com/WildlifeDatasets/wildlife-datasets to commit 959c6c01a8317ed4f162ebfdf4b5e63faad60228
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting git+https://github.com/WildlifeDatasets/wildlife-tools
  Cloning https://github.com/WildlifeDatasets/wildlife-tools to c:\users\trade\appdata\local\temp\pip-req-build-fs537d76
  Resolved https://github.com/WildlifeDatasets/wildlife-tools to commit 71aa4656d16afe4caae6d84af642bab81d

  Running command git clone --filter=blob:none --quiet https://github.com/WildlifeDatasets/wildlife-datasets 'C:\Users\trade\AppData\Local\Temp\pip-req-build-hek_ycds'
  Running command git checkout -b develop --track origin/develop
  branch 'develop' set up to track 'origin/develop'.
  Switched to a new branch 'develop'
  Running command git clone --filter=blob:none --quiet https://github.com/WildlifeDatasets/wildlife-tools 'C:\Users\trade\AppData\Local\Temp\pip-req-build-fs537d76'
  Running command git clone --filter=blob:none --quiet https://github.com/cvg/LightGlue.git 'C:\Users\trade\AppData\Local\Temp\pip-install-cla5r4lq\lightglue_b77de58a12fa4c948e988847e28e617c'

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import os
print(os.environ.get("VIRTUAL_ENV"))

import copy
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from sklearn.metrics import balanced_accuracy_score
from wildlife_datasets.datasets import WildlifeReID10k
from wildlife_datasets.splits import analyze_split
from wildlife_datasets.metrics import BAKS, BAUS
from utils_wildlifereid10k.utils_wildlifereid10k import get_summary_species, compute_predictions, mean

C:\Hub\DAT550-Animal-CLEF\venv


  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# %pip install kagglehub
# import kagglehub
# path = kagglehub.dataset_download("wildlifedatasets/wildlifereid-10k")
# path2 = kagglehub.dataset_download("wildlifedatasets/wildlifereid-10k-features")
# print("Path to dataset files:", path)
#print("Path to feature files:", path2)

Now we load the dataset and show the dataframe with 140,488 images, where each depicts an individual animal. There are 10,772 individual animals.

In [4]:
root_data = 'C:/Users/trade/.cache/kagglehub/datasets/wildlifedatasets/wildlifereid-10k/versions/6'
dataset = WildlifeReID10k(root_data, check_files=False)
metadata = dataset.metadata
metadata

Unnamed: 0,image_id,identity,path,date,orientation,species,split,dataset
0,0,AAUZebraFish_5,images/AAUZebraFish/data/Vid2_0520_0000ef3f3f3...,,right,fish,train,AAUZebraFish
1,1,AAUZebraFish_3,images/AAUZebraFish/data/Vid1_0062_0002c4ff6de...,,right,fish,train,AAUZebraFish
2,2,AAUZebraFish_4,images/AAUZebraFish/data/Vid2_1065_000444965a5...,,right,fish,train,AAUZebraFish
3,3,AAUZebraFish_6,images/AAUZebraFish/data/Vid2_0381_001446b650f...,,right,fish,train,AAUZebraFish
4,4,AAUZebraFish_2,images/AAUZebraFish/data/Vid1_0582_00173b45939...,,left,fish,test,AAUZebraFish
...,...,...,...,...,...,...,...,...
140483,140483,ZindiTurtleRecall_t_id_ip3jsrYo,images/ZindiTurtleRecall/images/ID_ZYTRP3VN_ID...,,left,sea turtle,train,ZindiTurtleRecall
140484,140484,ZindiTurtleRecall_t_id_o8HFaaCp,images/ZindiTurtleRecall/images/ID_ZZ04P34G_ID...,,,sea turtle,test,ZindiTurtleRecall
140485,140485,ZindiTurtleRecall_t_id_ruF8Nbxs,images/ZindiTurtleRecall/images/ID_ZZD2VBPA_ID...,,,sea turtle,train,ZindiTurtleRecall
140486,140486,ZindiTurtleRecall_t_id_m2JvEcsg,images/ZindiTurtleRecall/images/ID_ZZEGHRM5_ID...,,left,sea turtle,train,ZindiTurtleRecall


We plot a 3*4 sample of the dataset. It is clear that the depicted species, the image quality and the time conditions are extremely different for all images. The sample contains three underwater photos and one night photo.

In [None]:
dataset.plot_grid(n_rows=3, n_cols=4, idx=np.arange(0, len(dataset), 10000)[::-1]);

We show summary of the incorporated species. The majority of species are wild and domestic species are in minority.

In [None]:
get_summary_species(metadata)

## Splits for algorithm training

WildlifeReID-10k contains a default split. The split is open-set, meaning that some animals are only in the testing set but not in the training set. For such animals, the algorithm should predict that they are new. The evaluation is possible for both the open-set and closed-set, where for the latter the individuals only in the testing set are ignored. The following summary shows that the training set consists of 79.65% images, the new individuals 10.68% and the known individuals the remaining 9.67%. In other words, the open-set problem will be evaluated at 20.35% of the dataset, while the closed-set only at 9.67% of the dataset.

In [None]:
idx_train = np.where(metadata['split'] == 'train')[0]
idx_test = np.where(metadata['split'] == 'test')[0]
analyze_split(metadata, idx_train, idx_test)

We can indeed verify that there are 946 individuals which are in the testing set only.

In [None]:
identity = metadata['identity'].to_numpy()
identity_train = identity[metadata['split'] == 'train']
identity_test = identity[metadata['split'] == 'test']
identity_test_only = list(set(identity_test) - set(identity_train))
len(identity_test_only)

## Baseline performance

We assume that the user has already trained some of his algorithm on the training set. We have used [MiewID](https://huggingface.co/conservationxlabs/miewid-msv3) to extract features. It is also possible to use features extracted by [MegaDescriptor](https://huggingface.co/BVRA/MegaDescriptor-L-384).

The predictions are computed dataset-wise. We therefore make a loop over datasets. In each loop, we load the corresponding features and make the predictions based on 1-NN with similarity scores. The similarity score is the cosine similarity, measuring the angle between the features. Whenever the similarity score is below the threshold `t`, we decide that the prediction is not sufficiently strong and predict `new_individual` instead. We compute the BAKS (balanced accuracy on known samples) and BAUS (balanced accuracy on unknown samples) metrics.

In [None]:
# root_features = '/kaggle/input/wildlifereid-10k-features/features_miew'
#root_features = '/kaggle/input/wildlifereid-10k-features/features_mega'
root_features = 'C:/Users/trade/.cache/kagglehub/datasets/wildlifedatasets/wildlifereid-10k-features/versions/2'

step = 0.01
ts = [-np.inf] + list(np.round(np.arange(0, 1+step/10, step), 2)) + [np.inf]
new_individual = 'new_individual'

baks = {t: {} for t in ts}
baus = {t: {} for t in ts}
result = {}
for name, metadata_dataset in dataset.metadata.groupby('dataset'):
    print(name)
    features = np.load(f'{root_features}/features_{name}.npy')

    idx_train = np.where(metadata_dataset['split'] == 'train')[0]
    idx_test = np.where(metadata_dataset['split'] == 'test')[0]

    idx_true, idx_pred, similarity = compute_predictions(features[idx_test], features[idx_train], return_score=True)
    idx_true = idx_test[idx_true]
    idx_pred = idx_train[idx_pred]
    idx_pred = idx_pred[:,0]
    similarity = similarity[:,0]

    y_true = metadata_dataset['identity'].iloc[idx_true]
    y_pred_closed = metadata_dataset['identity'].iloc[idx_pred]

    identity_test_only = list(set(metadata_dataset['identity'].iloc[idx_test]) - set(metadata_dataset['identity'].iloc[idx_train]))

    for t in ts:
        y_pred = copy.copy(y_pred_closed)
        with np.errstate(invalid='ignore'):
            y_pred[similarity < t] = new_individual
        baks[t][name] = BAKS(y_true, y_pred, identity_test_only)
        baus[t][name] = BAUS(y_true, y_pred, identity_test_only, new_individual)

We plot BAKS and BAUS averaged over datasets for different thresholds.

In [None]:
data_baks = np.array([mean(baks[t]) for t in ts])
data_baus = np.array([mean(baus[t]) for t in ts])

plt.scatter(data_baks, data_baus, color='blue', marker='+')
plt.xlabel('BAKS')
plt.ylabel('BAUS')
plt.xlim([0, 1.01])
plt.ylim([0, 1.01]);

Similarity, we plot the normalized accuracy, which is the geometric mean between BAKS and BAUS.

In [None]:
plt.plot(ts, np.sqrt(data_baks * data_baus))
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.xlabel('threshold')
plt.ylabel('normalized accuracy');

Finally, we print the baseline results. For the closed-set, we select the threshold $t=-\infty$, where no predictions are made as new individual. Due to the structure of BAKS, all new individuals from the testing set are ignored during inference. For the open-set, we arbitrarily select the threshold $t=0.7$.

In [None]:
print(f'Closed-set accuracy = {mean(baks[-np.inf])}')
print(f'Open-set normalized accuracy = {np.sqrt(mean(baks[0.7])*mean(baus[0.7]))}')