# Introduction to the ZooLake dataset

## 1 Introduction
The ZooLake dataset is an open data project from Eawag that aims to automate the classification of currently 35 different lake plankton species using deep learning and other machine learning algorithms. The objective of the image classification is to enable the monitoring of the different plankton populations over time, as plankton are effective indicators of environmental change and ecosystem health in freshwater habitats.

The collection of images of plankton is an ongoing process, with the objective of improving the classification through the addition of more images. the most recent images that have not been manually labelled by a taxonomist can be viewed at the webpage [Aquascope](https://aquascope.ch/) under the heading 'Latest Greifensee'.


### 1.1 Public dataset versions

As of the present date, 20 September 2024, they are a two version of the ZooLake dataset aviable for the public on the eawag open research page eric. This inculde following Versions:

- [ZooLake](https://opendata.eawag.ch/dataset/deep-learning-classification-of-zooplankton-from-lakes),  is the initial version of the dataset referenced by the paper 'Deep Learning Classification of Lake Zooplankton' with a tota of 17900 labelled images.

- [Zoolake2.0](https://data.eawag.ch/dataset/data-for-producing-plankton-classifiers-that-are-robust-to-dataset-shift), second version of the dataset, which include more labelled data and the introduction of the *out-of-dataset (OOD)*. The OOD was utilised by C. Cheng et al. (2024) in their research into producing plankton classifiers that are robust to dataset shift. It also mentioned there,  that the ZooLake2.0 images come with a 2-year gap of the fist ZooLake Version and a total of 24'000 images


- Inside of the open eric platform, classified images are provided for all the images since the beginning of the project in 2019. The packages can be found under the
name [Aquascope](https://opendata.eawag.ch/dataset/?q=Aquascope) inside the search bar of the open eric platform.

# 2 Data structure of the labled dataset

The image names and folder structure of the labelled dataset contain comprehensive information regarding the specific class, the time at which the image was captured and further details about the image preprocessing 

### 2.1 Image name

The image name contains infrmation about the camera, the time of the image, the image seize and further details about how the image was captured. For a more detailed 
description of the image name, please refer to the publication ***, where the image name and further details about the Aquascope is described in detail.


$$
\small
\begin{array}{c}
\textbf{SPC-EAWAG-0P5X Image name example} 
\newline
\begin{array}{cccccccccccccccccc}
\text{SPC-EAWAG-0P5X-} & \text{1559498410-} & \text{191177-} & \text{6403834470-} & \text{952-} & \text{000009-} & \text{061-} & \text{1220-} & \text{2378-} & \text{52-} & \text{40} & \text{.jpeg} \\
\textit{1} &   \textit{2} &   \textit{3} &   \textit{4} &  \textit{5} &   \textit{6} &   \textit{7} &   \textit{8} &   \textit{9} &   \textit{10} &  \textit{11}
\end{array}
\end{array}
$$

$$
\small
\begin{array}{c}
\textbf{Data Field Descriptions for the SPC-EAWAG-0P5X Image} \\
\newline
\begin{array}{|c|c|}
\hline
\text{Nr.} & \text{Description} \\
\hline
\text{1} & \text{Camera name} \\
\text{2} & \text{Unixtime} \\
\text{3} & \text{Camera micros (microseconds) } \\
\text{4} & \text{Frame number} \\
\text{5} & \text{RoI number (Region of Interest)  } \\
\text{6} & \text{RoI left position } \\
\text{7} & \text{RoI top position } \\
\text{8} & \text{RoI width} \\
\text{9} & \text{Roi height } \\
\text{10} & \text{Image width} \\
\text{11} & \text{Image height} \\
\hline
\end{array}
\end{array}
$$



### 2.1 Folder structure of the labelled dataset
The dataset structure of the first version of ZooLake differs slightly from the usual computer vision dataset structure of having the class_folder/image_name.jpg. The dataset are structured in the following manner:

```md

            ZooLake1 (original .Zip name: "data")                                
        zooplankton_0p5x/                            
        ├── aphanizomenon/                        
        │   ├── training_data/                      
        │   │   ├── image1.jpeg                        
        │   │   └── ...                           
        │   └── features.tsv                               
        ├── class2/                               
        │   ├── training_data                        
        │   │   ├── image1.jpeg                                 
        │   │   └── ...                                                                        
        │   └── features.tsv                                           
        ├──  ...
        └──  zoolake_train_test_val_separated/ 
            ├── classes_ERIC.npy
            ├── Data.pickle
            ├── test_filenames.txt
            ├── train_filenames.txt
            └── val_filenames.txt        
        
            ZooLake2.0
        zooplankton2/
        ├── aphanizomenon/
        │   ├── image1.jpg
        │   └── ...│
        ├── class2/
        │   ├── image1.jpg
        │   ├── image2.jpg
        │   └── ...
        ├── ... 
        └── Files_used_for_training_testing.pickle 
``` 

In both data sets, the 35 folders representing a plankton species with classified images of the plankton represent the most significant information within the dataset. Version 1 of the data set differs from version 2 in thetween each plankton class folder and images there is a folder named `training_data` and the file `features.tsv`. It should be noted that the folder named as `training_data` does not contain details regarding the training data for the model, contrary to assumptions that might be made.  The .tsv files contain the data used for the classification model referenced in Deep Learning Classification of Lake Zooplankton by Kyathanahally, S. et al. It is anticipated that the folder structure will align with that of ZooLake Version 2 for subsequent versions.

### 2.2 Files for the resconstruction of the train/val/test split

The train/validation and test split utilised in the linked publication of the dataset can be replicated with the provided split information in the different versions. Both versions of ZooLake include a list of each image name and its corresponding split. However, there is a discrepancy in the manner of storage. 
 

- [ZooLake1]() The image file names that should be used to recreate the split are listed in the folder zoolake_train_test_val_separated in the form of different text files. The text files are labelled with the names of the splits.

- [ZooLake2]() The split is contained within in the only pickle file  `Files_used_for_training_testing.pickle` as a pd.DatFrame object. Based on the open source code by C.Chen, it is assumed that the image names used for the training set are in the first column, for the validation set in the second, and for the test set in the third. This can be  seen in the implemetion of the function [SplitFromPickle](https://github.com/cchen07/Plankiformer_OOD/blob/main/utils_analysis/train_val_test_split.py)inside the train_val_test_split.py  in the  GitHub repository of the Plankiformer_OOD project.

