# Moderate Balanced Dataset Building for basic CNN Model


### The orginal dataset can be downloaded from the AI Hub website: https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=data&dataSetSn=71667 

### Extracting 5000 images per category for training and 500 images per category for validation from 유충 dataset:

#### Normal (정상), 	Varroa Mite (응애), 	Foulbrood (부저) Chalkbrood (석고병)


### Reason behind choosing 유충 dataset for CNN Classification:

#### Starting with 유충 (Larvae) offers greater complexity with four distinct categories (Normal, Varroa Mite, Foulbrood, Chalkbrood), providing a richer challenge compared to 성충 (Adult Bees). The visible symptoms in Foulbrood and Chalkbrood make it easier to train and interpret models, resulting in clearer initial results. This complexity helps build a more robust model and lays a strong foundation for Explainable AI and comparative analysis with advanced models like YOLOv8 and EfficientDet.

In [1]:
import os
import shutil

In [3]:
# Directories for source and destination

source_train_dir = r'C:\Users\user\Documents\EcoUp\Honeybee Data\OpenData\Data\Training\Source Data\TS\유충'  # Path to 유충 training folders
source_val_dir = r'C:\Users\user\Documents\EcoUp\Honeybee Data\OpenData\Data\Validation\Source Data\VS\유충'  # Path to 유충 validation folders

output_train_dir = r'C:\Users\user\Documents\EcoUp\CNN Model\Training'  # Path where the training subset will be saved
output_val_dir = r'C:\Users\user\Documents\EcoUp\CNN Model\Validation'  # Path where the validation subset will be saved

# Define the categories to include in the subset
categories = ['유충_부저병', '유충_석고병', '유충_응애', '유충_정상']

# Number of images per category for the subset
num_train_images_per_category = 5000  # Extract 5000 images per category for training
num_val_images_per_category = 500  # Extract 500 images per category for validation

# Function to create the subset of images
def create_image_subset(source_dir, output_dir, categories, num_images_per_category):
    os.makedirs(output_dir, exist_ok=True)

    for category in categories:
        category_path = os.path.join(source_dir, category)
        category_output_path = os.path.join(output_dir, category)
        os.makedirs(category_output_path, exist_ok=True)

        # List all subfolders (each folder containing an image) within the category
        image_folders = [f for f in os.listdir(category_path) if os.path.isdir(os.path.join(category_path, f))]

        # Copy the images from the subfolders to the category folder
        images_copied = 0
        for image_folder in image_folders:
            image_folder_path = os.path.join(category_path, image_folder)
            image_files = [f for f in os.listdir(image_folder_path) if f.endswith('.jpg')]

            # Copy the first image found in each folder
            if image_files:
                img_file = image_files[0]
                img_src_path = os.path.join(image_folder_path, img_file)
                img_dst_path = os.path.join(category_output_path, img_file)
                shutil.copy(img_src_path, img_dst_path)
                images_copied += 1

            if images_copied >= num_images_per_category:
                break

        print(f'Copied {images_copied} images from {category}')

# Create training subset (5,000 images per category)
create_image_subset(source_train_dir, output_train_dir, categories, num_train_images_per_category)

# Create validation subset (500 images per category)
create_image_subset(source_val_dir, output_val_dir, categories, num_val_images_per_category)


Copied 5000 images from 유충_부저병
Copied 5000 images from 유충_석고병
Copied 5000 images from 유충_응애
Copied 5000 images from 유충_정상
Copied 500 images from 유충_부저병
Copied 500 images from 유충_석고병
Copied 500 images from 유충_응애
Copied 500 images from 유충_정상
