# Splits of iamges to model
This notebook presents how the splits for the models and the dataset version are handlet inside the plankton classifier pipeline.

## 1 Challenges and requiremnts of the split creation

In order to ensure the reproducibility of old splits and to provide enough felxibility to try new split, a own modul was coded that could fulfill following points:

- Reproducibility of old splits
- Creation of new splits based on past data versions
- Consideration of an OOD data set
- Simple to implement extensions with own split methods


This notebook shows how the split processing is handled, how to use it and how to register a new split version.

### 1.1 Requires

The notebook requires that all images are saved in the corresponding class folder, irrespective of the dataset version. Additionally, an overview data frame with the splits of the ZooLake version 1 and 2 must be provided. This can be achieved by utilising the notebook `1_data_set_overview.ipynb`, which is included in the notebook directory.

### 1.2 Preparations

The preparatios includes the navigating to the right dir level, installing the needed package and loading the overview dataframe


In [1]:
import logging


# Clear existing handlers
logger = logging.getLogger()
if logger.hasHandlers():
    logger.handlers.clear()

logger.setLevel(logging.DEBUG)

# Formatter and StreamHandler for the Notebook
handler = logging.StreamHandler()

# set format with file, line number, function name
handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
logger.addHandler(handler)

# Example log messages
logger.debug("Example debug message")
logger.info("Example info message")
logger.warning("Example warning message")

2024-12-20 11:10:35,337 - root - DEBUG - Example debug message
2024-12-20 11:10:35,338 - root - INFO - Example info message


In [2]:
# preparations
import os 

import pandas as pd

def check_current_work_dir():
    if not os.path.isfile("setup.py") or  os.path.basename(os.getcwd()).endswith('notebooks'):
        print("Changing the current directory to the parent directory containing the setup.py file")

        # move one folder up
        os.chdir("..")
        print(f"New current directory: {os.getcwd()}, it will remain this working directory for the rest of the notebook")

    if not os.path.isfile("setup.py"):
        raise Exception("setup.py not found in the current directory")

check_current_work_dir()

check_current_work_dir()
# installation of the package 
# "%"  makes the installation from a notebook cell out possible
# and "." since the setup.py is in the current directory
%pip install .


Changing the current directory to the parent directory containing the setup.py file
New current directory: c:\Repos\plankton_classifier, it will remain this working directory for the rest of the notebook
Processing c:\repos\plankton_classifier
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Building wheels for collected packages: lit_ecology_classifier
  Building wheel for lit_ecology_classifier (pyproject.toml): started
  Building wheel for lit_ecology_classifier (pyproject.toml): finished with status 'done'
  Created wheel for lit_ecology_classifier: filename=lit_ecology_classifier-2.0-py3-none-any.whl size=96269 sha256=8b7a8f7455f9971d47bcaa96c155fe367e921efef6bdfcf785a3846df667773c
  Stored in directory: 


[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
# load the overview data
path = os.path.join("data", "interim",  "overview.csv")
df = pd.read_csv( filepath_or_buffer= path)
df.head()

Unnamed: 0,image,class,sha256,date,OOD_v2,version_1,version_2,train_v1,test_v1,val_v1,train_v2,test_v2,val_v2
0,SPC-EAWAG-0P5X-1570543372901157-3725350526242-...,aphanizomenon,6fb0b3fa4b36614703ee1abdcf8efba4cd936982ca5fb6...,2019-10-08 14:02:52+00:00,False,True,True,True,False,False,True,False,False
1,SPC-EAWAG-0P5X-1570543374882008-3725352526408-...,aphanizomenon,09e4aa12fdc992bbd840b7913f6f35394637bc2135c49f...,2019-10-08 14:02:54+00:00,False,True,True,True,False,False,False,True,False
2,SPC-EAWAG-0P5X-1589472012505862-10217420880920...,aphanizomenon,1ace5cdd5a68e8cd5fa703c92ac7c6e6b1d362b517132f...,2020-05-14 16:00:12+00:00,False,True,True,True,False,False,True,False,False
3,SPC-EAWAG-0P5X-1589472120505648-10217528889899...,aphanizomenon,f9a38d8538b1ac64383199851c61ad2f7f784e430086ea...,2020-05-14 16:02:00+00:00,False,True,True,True,False,False,True,False,False
4,SPC-EAWAG-0P5X-1589472215513831-10217623897796...,aphanizomenon,9cfb8f3f9d36cb50c32bedc72724092a7a01576ccb8529...,2020-05-14 16:03:35+00:00,False,True,True,False,False,True,True,False,False


In [4]:
# Transformation to a tidy format
df = df.filter(regex="train|test|val|image|class|sha256|OOD")

# Melt test, train, and val columns into one column per row and version
df_melted = df.filter(regex="v1|v2|image|class|sha256").melt(
    id_vars=['image', 'class', 'sha256'],
    var_name='split'
)

# Extract 'version' and 'split' from the 'split' column using vectorized string methods
df_melted["version"] = df_melted["split"].str.split('_').str[1]
df_melted["split"] = df_melted["split"].str.split('_').str[0]

# Keep only rows where 'value' is 1
df_melted = df_melted[df_melted['value'] == 1]

# Replace version labels for consistency
df_melted["version"] = df_melted["version"].replace({"v1": "1", "v2": "2"})

# Drop the 'value' column as it's no longer needed
df_melted = df_melted.drop(columns="value")

# Display the first few rows of the transformed dataframe
df_melted.head()


Unnamed: 0,image,class,sha256,split,version
29640,SPC-EAWAG-0P5X-1624734366033156-29241721188723...,aphanizomenon,f918287e56745c04bda26e4e41c7410829460f1e00c9b8...,OOD,2
29641,SPC-EAWAG-0P5X-1624662155318571-29169511575069...,asplanchna,a09365d3b45b8d69976b0796f885d51c8763b7e035def9...,OOD,2
29642,SPC-EAWAG-0P5X-1624662157341137-29169513575235...,asplanchna,a4ca7d1c4daac22313245706347e3544fa90722273472c...,OOD,2
29643,SPC-EAWAG-0P5X-1624662354355777-29169710591612...,asplanchna,55df99360fdc74e9485846e4d836232df919e832e23a1e...,OOD,2
29644,SPC-EAWAG-0P5X-1624662437359090-29169793598512...,asplanchna,1610bc0981eb9b8543fa0a4e8b6afaa1a22788b6fdbcbc...,OOD,2



## 2 Split Overview

The split overview was created with the ambition to keep informations and track of the used split. It includes following columns:

$$
\small

\begin{array}{c}
\textbf{Description of the split overview dataframe}\\
\newline
\begin{array}{|l|l|}
\hline
\textbf{Column name} & \textbf{Description} \\
\hline
\text{dataset version} & \text{Version of the Dataset used for the split} \\
\text{OOD} & \text{OOD version included in the split} \\
\text{split strategy} & \text{Used split strategy} \\
\text{combined split hash} & \text{Hash value to identfiy and check the split} \\
\text{Description} & \text{Place to add a short description of the split} \\
\hline
\end{array}
\end{array}

$$

\
The current status of the split overview can be found in the folder "interim\UsedSplit\". Nevertheless, the intention is to store the overview split in a database in the near future.



In [5]:
# loading of the split overview 
path = os.path.join("data", "interim", "UsedSplits", "split_overview.csv")

types = {
    "dataset_version": "str",
    "OOD": "str",
    "split_strategy": "str",
    "combined_split_hash": "str",
    "fescription": "str"
}

split_overview = pd.read_csv(filepath_or_buffer= path, index_col=False, dtype=types)
split_overview.head()

Unnamed: 0,dataset_version,OOD,split_strategy,filter_strategy,combined_split_hash,description
0,1,,Unknown,PlanktonFilter,7ae8bedcd1f7b93380ada9d97df367f75e4ff22c0f9214...,Split used for Deep Learning Classification of...
1,2,OOD_2,Unknown,PlanktonFilter,7ac1342e84ca156574ef657e342945eeee398dc01c0563...,Split used for Producing Plankton Classifiers ...


## 3 Split hashes and comnined split hash

In order to not only store information but also to verify whether the reconstructed splits align with the original ones,  the property of hashing was again utilised based on the  images hashesh.

### 3.1 Split hashes

The split hashes are calculated based on a sorted join of each hash within each unique value of the column "split".  This is shown for the first Zoolake version below.


In [6]:
# example of the split hashes 

from lit_ecology_classifier.helpers.hashing import HashGenerator

# filter to calculate the hashes for the first version
df_v1 = df_melted[df_melted["version"] == "1"]

# generate the hashes
hashes_v1 = HashGenerator().generate_hash_dict_from_split(df_v1, col_to_hash ="sha256", group_by_col= "split")

hashes_v1

{'test': '3dc0e5aadb042c37d8e52908b23c9b0af83e2497109e7e1c5a25d2b65c5e14be',
 'train': 'bc9bb5d05fbdd28547737c9953d10fa3f584e9a832dc7e2c75d1b6751f5a2024',
 'val': '1d24484776f9579e8bdd468c884dd4df15eb4429ea89d136fed636245d8c49e9'}

As demonstrated by the output, the values of the dictionary represent the calculated hash value for each split. This makes it possible to find out if each reproduced split  correspond to the original one.

### 3.1 Combined split hashes
A value per split presents the disadvantage of complicating a direct comparison of whether the entire split is identical. For this reason, the individual hash values are sorted and hashed again. This leads to the loss of knwoing wich hash differs now. Example:

In [7]:
combined_hash = HashGenerator().sha256_from_list(hashes_v1.values())    
combined_hash

'7ae8bedcd1f7b93380ada9d97df367f75e4ff22c0f9214c00e80c52845c5eaed'

The utilisation of the images and split hashes permits the theoretical identification down to the pixel value whether the splits are identical or not.

## 4 SplitProcessor

In order to fulfil the necessary requirements to create and recreate old splits, the class `SplitProcessor` was implemented. The Split Processor provides the main functionalities to manage all aspects of the data splitting process, including the checking of existing splits, the creation of new splits and the execution of the image copying. 

The implementation of the split processor is object-oriented, employing polymorphism for the split strategy and inheritance of the `base_image_mover.py` functionalities for the copying of the images. 

At the momement, the created splits and overview are stored inside the file folder `interim\UsedSplit\`  However, it is intended that this will be replaced in the near future with a database.

### 4.1 Attributes

xyz


### 4.2 How to use


#### 4.2.1 Recreation of splits 

To recreate a used split the split strategy, dataset version and OOD Version need to be defined. 

In [None]:
# recreation of a split
from lit_ecology_classifier.splitting.split import SplitProcessor

split_processor = SplitProcessor(
                                split_overview = split_overview,
                                split_folder = r"data\interim\UsedSplits",
                                image_overview= "data\interim\overview.csv",
                                split_strategy= 'Unknown',
                                filter_strategy= 'PlanktonFilter'
                                )

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 12-13: truncated \UXXXXXXXX escape (4057599675.py, line 6)

In [None]:
getattr(split_processor, "split_df")

Unnamed: 0,image,class,sha256,split,version
0,SPC-EAWAG-0P5X-1570543372901157-3725350526242-...,aphanizomenon,6fb0b3fa4b36614703ee1abdcf8efba4cd936982ca5fb6...,train,1
1,SPC-EAWAG-0P5X-1570543374882008-3725352526408-...,aphanizomenon,09e4aa12fdc992bbd840b7913f6f35394637bc2135c49f...,train,1
2,SPC-EAWAG-0P5X-1589472012505862-10217420880920...,aphanizomenon,1ace5cdd5a68e8cd5fa703c92ac7c6e6b1d362b517132f...,train,1
3,SPC-EAWAG-0P5X-1589472120505648-10217528889899...,aphanizomenon,f9a38d8538b1ac64383199851c61ad2f7f784e430086ea...,train,1
4,SPC-EAWAG-0P5X-1589472588541825-10217996928805...,aphanizomenon,5ca3294d8df48501fd83731564c93442fedd1002c1b6f4...,train,1
...,...,...,...,...,...
18079,SPC-EAWAG-0P5X-1563195834608473-10101205289100...,keratella_cochlearis,225a67488e4bab2ece16cf9fc2c013524608601c7b2350...,val,1
18080,SPC-EAWAG-0P5X-1563196067636305-10101438308470...,keratella_cochlearis,64882fcbfbc4aff5835c4cd21e16a0201c1a12a51c4d98...,val,1
18081,SPC-EAWAG-0P5X-1575370922015129-8552825698030-...,keratella_cochlearis,240ef901638276c790bb83c791b48eb04ecd09b53bcc43...,val,1
18082,SPC-EAWAG-0P5X-1589537321913540-10282729292872...,keratella_cochlearis,f5edd8827d77807a30626961e984e3536f66fe30d2b0c7...,val,1


#### 4.2.2 Use of build in split strategies

In [None]:
split_processor = SplitProcessor(
                                split_strategy= 'Stratified',
                                filter_strategy= 'PlanktonFilter',
                                split_overview = split_overview, 
                                image_overview= "data\interim\overview.csv",
                                filter_args= {"dataset_version":"1"})


2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Image overview column types: image        object
class        object
sha256       object
date         object
OOD_v2         bool
version_1      bool
version_2      bool
train_v1       bool
test_v1        bool
val_v1         bool
train_v2       bool
test_v2        bool
val_v2         bool
dtype: object
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Class name: Stratified
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Splitstrategie: Stratified
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Filterstrategie: PlanktonFilter
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,216 - lit_ecology_classifier.splitting.split - DEBUG - Class name: Stratified
2024-12-16 11

In [None]:
getattr(split_processor, "split_overview_df")

Unnamed: 0,dataset_version,OOD,split_strategy,filter_strategy,combined_split_hash,description
0,1.0,,Unknown,PlanktonFilter,7ae8bedcd1f7b93380ada9d97df367f75e4ff22c0f9214...,Split used for Deep Learning Classification of...
1,2.0,OOD_2,Unknown,PlanktonFilter,7ac1342e84ca156574ef657e342945eeee398dc01c0563...,Split used for Producing Plankton Classifiers ...
2,,,Stratified,PlanktonFilter,eca71e33b09d856f6a8d2b9c2ecb8fbec2d335b640349c...,


In [None]:
getattr(split_processor, "split_df")

Unnamed: 0,image,class_map,split,sha256
0,SPC-EAWAG-0P5X-1570496498567211-3678476912301-...,27,train,e08bae075ae69dcde0bf83929c8fff5b3bac78e5a91e86...
1,SPC-EAWAG-0P5X-1602248934427170-6756627266623-...,34,train,1e1e28cad3f2b9f419ff564df6c3151fc3c4d3c0cd04ba...
2,SPC-EAWAG-0P5X-1657771518040069-62278380186028...,25,train,778882d5c100a75e52db1b40b34057f7cff661cca5665f...
3,SPC-EAWAG-0P5X-1659985235615601-64492066900435...,35,train,8262751c828fdb1450702399b2a7266037c8615079a34e...
4,SPC-EAWAG-0P5X-1656770957698475-61277833504801...,21,train,9e99e17ed206341d0bdf019667c6b9d1fcf1bb75338b99...
...,...,...,...,...
39437,SPC-EAWAG-0P5X-1637899349834501-42406507666460...,19,test,c14264dd6af57153fdba959e5e1eb53a0ae0cca4d7d52f...
39438,SPC-EAWAG-0P5X-1656907704320527-61414578280829...,28,test,8be1c0bab769b1f241dc461e916df1d963dfd1ec7c567e...
39439,SPC-EAWAG-0P5X-1590653329775022-11398720303487...,16,test,a02247a0645fd405168c8b239575d137ba489dd74c451f...
39440,SPC-EAWAG-0P5X-1659812915599649-64319749241096...,20,test,2098b9b552be5af76dc1fed870db37e338bfa096ffa48a...


In [None]:
getattr(split_processor, "class_map")

{'aphanizomenon': 1,
 'asplanchna': 2,
 'asterionella': 3,
 'bosmina': 4,
 'brachionus': 5,
 'ceratium': 6,
 'chaoborus': 7,
 'collotheca': 8,
 'conochilus': 9,
 'copepod_skins': 10,
 'cyclops': 11,
 'daphnia': 12,
 'daphnia_skins': 13,
 'diaphanosoma': 14,
 'diatom_chain': 15,
 'dinobryon': 16,
 'dirt': 17,
 'eudiaptomus': 18,
 'filament': 19,
 'fish': 20,
 'fragilaria': 21,
 'hydra': 22,
 'kellicottia': 23,
 'keratella_cochlearis': 24,
 'keratella_quadrata': 25,
 'leptodora': 26,
 'maybe_cyano': 27,
 'nauplius': 28,
 'paradileptus': 29,
 'polyarthra': 30,
 'rotifers': 31,
 'synchaeta': 32,
 'trichocerca': 33,
 'unknown': 34,
 'unknown_plankton': 35,
 'uroglena': 36}

#### Useage with own splot strategie

In [None]:
from sklearn.model_selection import train_test_split
from lit_ecology_classifier.splitting.split_strategies.base_split_strategy import BaseSplitStrategy

class ExampleSplitProcessor(BaseSplitStrategy):

    def perform_split(self, df, y_col = "class_map"):
        X = df["image"]
        y = df[y_col]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        return {
            "train": [X_train,y_train],
            "test": [X_test, y_test]    
        }


In [None]:
split_processor = SplitProcessor(split_strategy= ExampleSplitProcessor(),
                                 filter_strategy= 'PlanktonFilter',
                                 split_overview = split_overview, 
                                 image_overview= "data\interim\overview.csv",
                                 filter_args= {"dataset_version":"2"})
    


2024-12-16 11:09:53,811 - lit_ecology_classifier.splitting.split - DEBUG - Image overview column types: image        object
class        object
sha256       object
date         object
OOD_v2         bool
version_1      bool
version_2      bool
train_v1       bool
test_v1        bool
val_v1         bool
train_v2       bool
test_v2        bool
val_v2         bool
dtype: object
2024-12-16 11:09:53,812 - lit_ecology_classifier.splitting.split - DEBUG - Class name: ExampleSplitProcessor
2024-12-16 11:09:53,812 - lit_ecology_classifier.splitting.split - DEBUG - Splitstrategie: ExampleSplitProcessor
2024-12-16 11:09:53,813 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,814 - lit_ecology_classifier.splitting.split - DEBUG - Filterstrategie: PlanktonFilter
2024-12-16 11:09:53,814 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,815 - lit_ecology_classifier.splitting.split - DEBUG - Class name: Ex

In [None]:
getattr(split_processor, "split_overview_df")

Unnamed: 0,dataset_version,OOD,split_strategy,filter_strategy,combined_split_hash,description
0,1.0,,Unknown,PlanktonFilter,7ae8bedcd1f7b93380ada9d97df367f75e4ff22c0f9214...,Split used for Deep Learning Classification of...
1,2.0,OOD_2,Unknown,PlanktonFilter,7ac1342e84ca156574ef657e342945eeee398dc01c0563...,Split used for Producing Plankton Classifiers ...
2,,,ExampleSplitProcessor,PlanktonFilter,4495df7f87c537adcdd6312a203a97aac567f5e9fdf971...,


## 5

In [None]:
split_processor.search_splits(filter_strategy= "PlanktonFilter", split_strategy= "Stratified")

2024-12-16 11:09:53,883 - lit_ecology_classifier.splitting.split - DEBUG - Class name: Stratified
2024-12-16 11:09:53,883 - lit_ecology_classifier.splitting.split - DEBUG - Splitstrategie: Stratified
2024-12-16 11:09:53,883 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,886 - lit_ecology_classifier.splitting.split - DEBUG - Filterstrategie: PlanktonFilter
2024-12-16 11:09:53,886 - lit_ecology_classifier.splitting.split - DEBUG - Class name: PlanktonFilter
2024-12-16 11:09:53,887 - lit_ecology_classifier.splitting.split - DEBUG - Class name: Stratified
2024-12-16 11:09:53,888 - lit_ecology_classifier.splitting.split - DEBUG - Existing split:Empty DataFrame
Columns: [dataset_version, OOD, split_strategy, filter_strategy, combined_split_hash, description]
Index: []
2024-12-16 11:09:53,889 - lit_ecology_classifier.splitting.split - INFO - No existing split found with the given strategies, creating new split.
2024-12-16 11:09:53,889 - lit_e

Unnamed: 0,image,class_map,split,sha256
0,SPC-EAWAG-0P5X-1570496498567211-3678476912301-...,27,train,e08bae075ae69dcde0bf83929c8fff5b3bac78e5a91e86...
1,SPC-EAWAG-0P5X-1602248934427170-6756627266623-...,34,train,1e1e28cad3f2b9f419ff564df6c3151fc3c4d3c0cd04ba...
2,SPC-EAWAG-0P5X-1657771518040069-62278380186028...,25,train,778882d5c100a75e52db1b40b34057f7cff661cca5665f...
3,SPC-EAWAG-0P5X-1659985235615601-64492066900435...,35,train,8262751c828fdb1450702399b2a7266037c8615079a34e...
4,SPC-EAWAG-0P5X-1656770957698475-61277833504801...,21,train,9e99e17ed206341d0bdf019667c6b9d1fcf1bb75338b99...
...,...,...,...,...
39437,SPC-EAWAG-0P5X-1637899349834501-42406507666460...,19,test,c14264dd6af57153fdba959e5e1eb53a0ae0cca4d7d52f...
39438,SPC-EAWAG-0P5X-1656907704320527-61414578280829...,28,test,8be1c0bab769b1f241dc461e916df1d963dfd1ec7c567e...
39439,SPC-EAWAG-0P5X-1590653329775022-11398720303487...,16,test,a02247a0645fd405168c8b239575d137ba489dd74c451f...
39440,SPC-EAWAG-0P5X-1659812915599649-64319749241096...,20,test,2098b9b552be5af76dc1fed870db37e338bfa096ffa48a...
