<a href="https://colab.research.google.com/github/SarahGhysels/SarahGhysels_thesis_2024/blob/Thesis/ThesisSarahGhysels_TrainTestSpit_threshold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating a stratified test set

The stratified train-test split is based on the threshold classes.

## Importing functions

In [None]:
%pip install split-folders tqdm



In [None]:
import os
import shutil
import splitfolders
import pandas as pd
import re
import numpy as np

## Reading in data

In [None]:
#Linking google drive
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [None]:
classesbreeder_directory = '/content/drive/MyDrive/Sarah_Ghysels_Thesis/DatamodelNewClips/ClassesBreeder'
newclasses_directory ='/content/drive/MyDrive/Sarah_Ghysels_Thesis/DatamodelNewClips/ClassesThreshold'
test_directory = '/content/drive/MyDrive/Sarah_Ghysels_Thesis/DatamodelNewClips'

## Creating new classes based on threshold DMY

In [None]:
#Determine threshold
Dataset_multigras= pd.read_csv("/content/drive/MyDrive/Sarah_Ghysels_Thesis/Datamodel/Multigras_data.csv", sep=';')
DMY = Dataset_multigras['DMY (kg/ha)']

DMY_int=[]
for i in range(0,len(DMY)):
   try: DMY_int.append(int(DMY[i]))
   except ValueError: DMY_int.append(-544)

DMY_int.sort()
Top10percent = round(9*4224/10)
threshold = DMY_int[Top10percent]
print(threshold)

6811


So the values with a DMY above or equal to 6811 will go to the 'Keep' class, the ones below 6811 will go to the 'Discard' class. The terms 'Keep' and 'Discard' refer to the plants the breeder would like to keep to continue breeding with and the plants that would be discarded in the breeding process.

In [None]:
#Matching images with DMY values

#read in the labels
labels = os.listdir(classesbreeder_directory)
# sort the training labesl
labels.sort()

for name in labels:
  dir = os.path.join(classesbreeder_directory,name)
  for file in os.listdir(dir):
    #get y value that corresponds with this image
    RowCol_string=re.findall(r'BLOK\d+R\d+P\d+',file)
    BlokRowCol=re.findall(r'\d+', RowCol_string[0])
    Blok=int(BlokRowCol[0])
    Row=int(BlokRowCol[1])
    Col=int(BlokRowCol[2])
    index=((Blok-1)*44*32+(Row-1)*32 + Col)-1
    try:y= np.float32(DMY[index])
    except ValueError:y=-1
    if y>=threshold:
      destination_path = os.path.join(newclasses_directory, 'Keep', file)
      source_path = os.path.join(dir, file)
      shutil.copy(source_path, destination_path)
    else:
      destination_path = os.path.join(newclasses_directory, 'Discard', file)
      source_path = os.path.join(dir, file)
      shutil.copy(source_path, destination_path)

In [None]:
#Checking if there is indeed 10% in the keep folder
keep_path = os.path.join(newclasses_directory, 'Keep')
discard_path = os.path.join(newclasses_directory, 'Discard')
lst_keep = os.listdir(keep_path)
number_files = len(lst_keep)
print(number_files)
lst_discard = os.listdir(discard_path)
number_files = len(lst_discard)
print(number_files)
#correct

414
3778


In [None]:
#Creating a stratified split in train and test set
splitfolders.ratio(newclasses_directory, output=test_directory, seed=1337, ratio=(0.9, 0.1), group_prefix=None, move=False)

Copying files: 4192 files [01:58, 35.25 files/s]


In [None]:
#checking how many images from each class are in the two folders
trainkeep =  os.path.join(test_directory,'train', 'Keep')
traindiscard =  os.path.join(test_directory,'train', 'Discard')
testkeep =  os.path.join(test_directory,'val', 'Keep')
testdiscard =  os.path.join(test_directory,'val', 'Discard')
lst = os.listdir(trainkeep)
number_files = len(lst)
print(number_files)
lst = os.listdir(traindiscard)
number_files = len(lst)
print(number_files)
lst = os.listdir(testkeep)
number_files = len(lst)
print(number_files)
lst = os.listdir(testdiscard)
number_files = len(lst)
print(number_files) #all correct

372
3400
42
378


42 is a very small number of images in the Keep class of the test set. This might influence the accuracy of the models on the test set. However, it was decided to keep this proportion of test and training set, since 372 is also a small number of images and this will still be split into a train and validation set, further reducing its size.

In [None]:
#renaming val to Test and train to Train
testdir_old =  os.path.join(test_directory,'val')
testdir_new =  os.path.join(test_directory,'Test')
os.rename(testdir_old, testdir_new)
traindir_old =  os.path.join(test_directory,'train')
traindir_new =  os.path.join(test_directory,'Train')
os.rename(traindir_old, traindir_new)