# Fruit Ripeness Level Classification
This notebook contains the Data Processing and Model training for the Ruit Ripeness Level Classification. Team Okra's Hackathon Solution.

# Datasets Download



## Secondary Datasets(Roboflow)

In [None]:
!pip install roboflow -qq

from roboflow import Roboflow
rf = Roboflow(api_key="***********")
project = rf.workspace("arm-oeppz").project("banana-8qkur")
version = project.version(2)
dataset = version.download("yolov8")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/85.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.8/66.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.9/49.9 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.0 MB/s[0m eta [36m0:00:00[0m
loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in banana-2 to yolov8:: 100%|██████████| 73845/73845 [00:07<00:00, 10112.65it/s]





Extracting Dataset Version Zip to banana-2 in yolov8:: 100%|██████████| 3858/3858 [00:00<00:00, 7113.45it/s]


In [None]:
project = rf.workspace("srmist-doq3j").project("rot_detection")
version = project.version(13)
dataset = version.download("yolov8")

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in rot_detection-13 to yolov8:: 100%|██████████| 173631/173631 [00:13<00:00, 13140.00it/s]





Extracting Dataset Version Zip to rot_detection-13 in yolov8:: 100%|██████████| 7972/7972 [00:01<00:00, 6582.63it/s]


In [None]:
project = rf.workspace("mert6107").project("orange_detection-5f84p")
version = project.version(2)
dataset = version.download("yolov8")

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in orange_detection-2 to yolov8:: 100%|██████████| 21727/21727 [00:02<00:00, 8082.20it/s] 





Extracting Dataset Version Zip to orange_detection-2 in yolov8:: 100%|██████████| 1230/1230 [00:00<00:00, 9426.76it/s]


In [None]:
project = rf.workspace("money-detection-xez0r").project("tomato-checker")
version = project.version(1)
dataset = version.download("yolov8")

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in tomato-checker-1 to yolov8:: 100%|██████████| 235668/235668 [00:14<00:00, 16347.68it/s]





Extracting Dataset Version Zip to tomato-checker-1 in yolov8:: 100%|██████████| 12660/12660 [00:02<00:00, 5767.22it/s]


## Primary Dataset (Copy from Drive)

In [None]:
# copy orange project dataset in drive to the Orange_Project
!cp -r /content/drive/MyDrive/Orange_Project /content/

###  Datasets Notes
Banana Ripeness Dataset: https://universe.roboflow.com/arm-oeppz/banana-8qkur/dataset/2
folde_name = banana-2
names:
- ripe banana
- riper banana
- rotten banana
- unripe banana

Orange dataset: Google Drive
Folder_name: Orange_Project
classes:
- Ripe
- Unripe

Orange dataset 2: https://universe.roboflow.com/mert6107/orange_detection-5f84p/dataset/
folder_name: orange_detection-2
classes:
names:
- RottenOranges
- orange

Tomato Dataset(use only unripe class):
folder_name: tomato-checker-1
classes:
names:
- damaged
- healthy ripe
- unripe




Rot detection dataset: https://universe.roboflow.com/srmist-doq3j/rot_detection/dataset/13
Classes:
names:
- cucumber_healthy
- cucumber_rotten
- eggplant_healthy
- eggplant_rotten
- grapes_healthy
- grapes_rotten
- spinach_healthy
- spinach_rotten
- tomato_healthy
- tomato_rotten

# Data Processing
This code combines multiple labeled fruit and vegetable image datasets into a unified dataset for training a model. The process involves:

1. Class Mapping: Maps different dataset-specific class names (e.g., "ripe banana", "cucumber_rotten") to standard labels: ripe, unripe, rotten.

2. Dataset Processing: Iterates over several datasets, reads image-label pairs, filters out unwanted classes, and limits each class to 3,000 samples.

3. Label Conversion: Updates label files to use the new unified class IDs based on the mapping.

4. Output Structure: Copies selected images and updated labels into a new combined_dataset directory, organized into train and val folders.

In [None]:
import os
import shutil
import random
from pathlib import Path
import yaml
import glob
import re

def create_directory(directory):
    """Create directory if it doesn't exist"""
    if not os.path.exists(directory):
        os.makedirs(directory)

def load_yaml(yaml_file):
    """Load YAML file"""
    with open(yaml_file, 'r') as f:
        return yaml.safe_load(f)

def save_yaml(data, yaml_file):
    """Save data to YAML file"""
    with open(yaml_file, 'w') as f:
        yaml.dump(data, f, default_flow_style=False)

def parse_class_mapping():
    """Map original dataset classes to our target classes"""
    class_mapping = {
        # Banana dataset mappings
        'ripe banana': 'ripe',
        'riper banana': 'ripe',  # Treat 'riper' as 'ripe'
        'rotten banana': 'rotten',
        'unripe banana': 'unripe',

        # Orange dataset mappings
        'Ripe': 'ripe',
        'Unripe': 'unripe',
        'RottenOranges': 'rotten',
        'orange': 'ripe',  #  general 'orange' is ripe

        # Tomato dataset mappings - only using unripe class as requested
        'unripe': 'unripe',
        'healthy ripe': 'ripe',
        'damaged': None,  # Not using this class

        # Rot detection dataset mappings
        'cucumber_healthy': 'ripe',  #  'healthy' is 'ripe'
        'cucumber_rotten': 'rotten',
        'eggplant_healthy': 'ripe',
        'eggplant_rotten': 'rotten',
        'grapes_healthy': 'ripe',
        'grapes_rotten': 'rotten',
        'spinach_healthy': 'ripe',
        'spinach_rotten': 'rotten',
        'tomato_healthy': 'ripe',
        'tomato_rotten': 'rotten'
    }
    return class_mapping

def get_source_datasets():
    """Define and return paths to all source datasets with their class names"""
    source_datasets = [
        {
            'name': 'banana-2',
            'train_images': 'banana-2/train/images',
            'train_labels': 'banana-2/train/labels',
            'val_images': 'banana-2/valid/images',
            'val_labels': 'banana-2/valid/labels',
            'class_names': ['ripe banana', 'riper banana', 'rotten banana', 'unripe banana']
        },
        {
            'name': 'Orange_Project',
            'train_images': 'Orange_Project/images/train',
            'train_labels': 'Orange_Project/label/train',
            'val_images': 'Orange_Project/images/val',
            'val_labels': 'Orange_Project/label/val',
            'class_names': ['Ripe', 'Unripe']
        },
        {
            'name': 'orange_detection-2',
            'train_images': 'orange_detection-2/train/images',
            'train_labels': 'orange_detection-2/train/labels',
            'val_images': 'orange_detection-2/valid/images',
            'val_labels': 'orange_detection-2/valid/labels',
            'class_names': ['RottenOranges', 'orange']
        },
        {
            'name': 'tomato-checker-1',
            'train_images': 'tomato-checker-1/train/images',
            'train_labels': 'tomato-checker-1/train/labels',
            'val_images': 'tomato-checker-1/valid/images',
            'val_labels': 'tomato-checker-1/valid/labels',
            'class_names': ['damaged', 'healthy ripe', 'unripe']
        },
        {
            'name': 'rot_detection-13',
            'train_images': 'rot_detection-13/train/images',
            'train_labels': 'rot_detection-13/train/labels',
            'val_images': 'rot_detection-13/valid/images',
            'val_labels': 'rot_detection-13/valid/labels',
            'class_names': [
                'cucumber_healthy', 'cucumber_rotten',
                'eggplant_healthy', 'eggplant_rotten',
                'grapes_healthy', 'grapes_rotten',
                'spinach_healthy', 'spinach_rotten',
                'tomato_healthy', 'tomato_rotten'
            ]
        }
    ]
    return source_datasets

def update_label_class_ids(label_file, class_mapping_ids, original_class_names):
    """Update class IDs in the label files"""
    with open(label_file, 'r') as f:
        lines = f.readlines()

    updated_lines = []
    for line in lines:
        parts = line.strip().split()
        if not parts:
            continue

        original_class_id = int(parts[0])
        if original_class_id >= len(original_class_names):
            print(f"Warning: Class ID {original_class_id} out of range in {label_file}")
            continue

        original_class_name = original_class_names[original_class_id]

        if original_class_name in class_mapping_ids:
            new_class_id = class_mapping_ids[original_class_name]
            updated_line = f"{new_class_id} {' '.join(parts[1:])}\n"
            updated_lines.append(updated_line)

    with open(label_file, 'w') as f:
        f.writelines(updated_lines)

def process_dataset():
    """Main function to process and combine all datasets"""
    # Setup directories
    output_dir = "combined_dataset"
    train_images_dir = os.path.join(output_dir, "train", "images")
    train_labels_dir = os.path.join(output_dir, "train", "labels")
    val_images_dir = os.path.join(output_dir, "val", "images")
    val_labels_dir = os.path.join(output_dir, "val", "labels")

    # Create directories
    create_directory(train_images_dir)
    create_directory(train_labels_dir)
    create_directory(val_images_dir)
    create_directory(val_labels_dir)

    # Load class mapping
    class_mapping = parse_class_mapping()

    # Create a list of all final classes and assign IDs
    final_classes = sorted(set(v for v in class_mapping.values() if v is not None))
    final_class_to_id = {class_name: idx for idx, class_name in enumerate(final_classes)}

    # Count images per class to enforce maximum limit
    class_image_count = {class_name: 0 for class_name in final_classes}
    max_images_per_class = 3000

    # Gather all images and labels from source datasets
    all_images = []
    source_datasets = get_source_datasets()

    for dataset in source_datasets:
        try:
            # Get the original class names directly from the dataset info
            original_class_names = dataset['class_names']

            # Create a mapping from original class IDs to new class IDs
            class_mapping_ids = {}
            for idx, class_name in enumerate(original_class_names):
                if class_name in class_mapping and class_mapping[class_name] is not None:
                    target_class = class_mapping[class_name]
                    class_mapping_ids[class_name] = final_class_to_id[target_class]

            # Process training images and labels
            train_images = glob.glob(os.path.join(dataset['train_images'], "*.*"))
            for img_path in train_images:
                img_filename = os.path.basename(img_path)
                label_filename = os.path.splitext(img_filename)[0] + ".txt"
                label_path = os.path.join(dataset['train_labels'], label_filename)

                # Skip if label file doesn't exist
                if not os.path.exists(label_path):
                    continue

                # Determine the class from the label file
                with open(label_path, 'r') as f:
                    lines = f.readlines()

                if not lines:
                    continue

                line_parts = lines[0].strip().split()
                if not line_parts:
                    continue

                original_class_id = int(line_parts[0])
                if original_class_id >= len(original_class_names):
                    continue

                original_class_name = original_class_names[original_class_id]
                if original_class_name not in class_mapping or class_mapping[original_class_name] is None:
                    continue

                target_class = class_mapping[original_class_name]

                # Skip if we've reached the maximum for this class
                if class_image_count[target_class] >= max_images_per_class:
                    continue

                # Add image and corresponding label to our collection
                all_images.append({
                    'dataset': dataset['name'],
                    'img_path': img_path,
                    'label_path': label_path,
                    'class': target_class,
                    'original_class_names': original_class_names,
                    'split': 'train'
                })

                class_image_count[target_class] += 1

            # Process validation images and labels
            val_images = glob.glob(os.path.join(dataset['val_images'], "*.*"))
            for img_path in val_images:
                img_filename = os.path.basename(img_path)
                label_filename = os.path.splitext(img_filename)[0] + ".txt"
                label_path = os.path.join(dataset['val_labels'], label_filename)

                # Skip if label file doesn't exist
                if not os.path.exists(label_path):
                    continue

                # Determine the class from the label file
                with open(label_path, 'r') as f:
                    lines = f.readlines()

                if not lines:
                    continue

                line_parts = lines[0].strip().split()
                if not line_parts:
                    continue

                original_class_id = int(line_parts[0])
                if original_class_id >= len(original_class_names):
                    continue

                original_class_name = original_class_names[original_class_id]
                if original_class_name not in class_mapping or class_mapping[original_class_name] is None:
                    continue

                target_class = class_mapping[original_class_name]

                # Skip if we've reached the maximum for this class
                if class_image_count[target_class] >= max_images_per_class:
                    continue

                # Add image and corresponding label to our collection
                all_images.append({
                    'dataset': dataset['name'],
                    'img_path': img_path,
                    'label_path': label_path,
                    'class': target_class,
                    'original_class_names': original_class_names,
                    'split': 'val'
                })

                class_image_count[target_class] += 1

        except Exception as e:
            print(f"Error processing dataset {dataset['name']}: {e}")

    # Print class distribution before splitting
    print("Class distribution before re-splitting:")
    for class_name, count in class_image_count.items():
        print(f"  {class_name}: {count} images")

    # Shuffle all images for better distribution
    random.shuffle(all_images)

    # Create new 70:30 split
    class_train_count = {class_name: 0 for class_name in final_classes}
    class_val_count = {class_name: 0 for class_name in final_classes}

    # Reset class counts for the new split
    for img_info in all_images:
        class_name = img_info['class']
        total_class_count = class_image_count[class_name]
        train_target = int(total_class_count * 0.7)  # 70% for training

        # Assign to train or val based on current counts
        if class_train_count[class_name] < train_target:
            img_info['new_split'] = 'train'
            class_train_count[class_name] += 1
        else:
            img_info['new_split'] = 'val'
            class_val_count[class_name] += 1

    # Print final split statistics
    print("\nFinal split statistics:")
    print("Train:")
    for class_name, count in class_train_count.items():
        print(f"  {class_name}: {count} images")

    print("\nValidation:")
    for class_name, count in class_val_count.items():
        print(f"  {class_name}: {count} images")

    # Copy images and labels to new directories
    for img_info in all_images:
        try:
            img_filename = os.path.basename(img_info['img_path'])
            dataset_prefix = re.sub(r'[^a-zA-Z0-9]', '_', img_info['dataset'])

            # Generate unique filenames by prefixing with dataset name
            unique_img_filename = f"{dataset_prefix}_{img_filename}"
            label_filename = os.path.splitext(unique_img_filename)[0] + ".txt"

            # Determine destination directories
            if img_info['new_split'] == 'train':
                dest_img_dir = train_images_dir
                dest_label_dir = train_labels_dir
            else:
                dest_img_dir = val_images_dir
                dest_label_dir = val_labels_dir

            # Copy image file
            shutil.copy2(img_info['img_path'], os.path.join(dest_img_dir, unique_img_filename))

            # Copy and update label file
            temp_label_path = os.path.join(dest_label_dir, label_filename)
            shutil.copy2(img_info['label_path'], temp_label_path)

            # Update class IDs in the label file
            class_mapping_ids = {}
            for idx, class_name in enumerate(img_info['original_class_names']):
                if class_name in class_mapping and class_mapping[class_name] is not None:
                    target_class = class_mapping[class_name]
                    class_mapping_ids[class_name] = final_class_to_id[target_class]

            update_label_class_ids(temp_label_path, class_mapping_ids, img_info['original_class_names'])

        except Exception as e:
            print(f"Error copying file: {e}")

    # Create dataset.yaml file
    dataset_yaml = {
        'path': os.path.abspath(output_dir),
        'train': 'train/images',
        'val': 'val/images',
        'names': {idx: name for name, idx in final_class_to_id.items()}
    }

    save_yaml(dataset_yaml, os.path.join(output_dir, "dataset.yaml"))
    print(f"\nCombined dataset created at: {os.path.abspath(output_dir)}")
    print(f"Dataset YAML file: {os.path.join(output_dir, 'dataset.yaml')}")
    print(f"\nClass ID mapping:")
    for class_name, class_id in final_class_to_id.items():
        print(f"  {class_id}: {class_name}")

if __name__ == "__main__":
    process_dataset()

Class distribution before re-splitting:
  ripe: 3000 images
  rotten: 2879 images
  unripe: 2105 images

Final split statistics:
Train:
  ripe: 2100 images
  rotten: 2015 images
  unripe: 1473 images

Validation:
  ripe: 900 images
  rotten: 864 images
  unripe: 632 images

Combined dataset created at: /content/combined_dataset
Dataset YAML file: combined_dataset/dataset.yaml

Class ID mapping:
  0: ripe
  1: rotten
  2: unripe


# Model training & Evaluation

## Setup



In [None]:
%pip install ultralytics
import ultralytics
ultralytics.checks()

Ultralytics 8.3.134 🚀 Python-3.11.12 torch-2.6.0+cu124 CUDA:0 (Tesla T4, 15095MiB)
Setup complete ✅ (2 CPUs, 12.7 GB RAM, 44.0/112.6 GB disk)


## Train model

In [None]:
# Train YOLO11s on combined dataset for 20 epochs
!yolo train model=yolo11s.pt data=/content/combined_dataset/dataset.yaml epochs=20 imgsz=640

Downloading https://github.com/ultralytics/assets/releases/download/v8.3.0/yolo11s.pt to 'yolo11s.pt'...
100% 18.4M/18.4M [00:00<00:00, 358MB/s]
Ultralytics 8.3.134 🚀 Python-3.11.12 torch-2.6.0+cu124 CUDA:0 (Tesla T4, 15095MiB)
[34m[1mengine/trainer: [0magnostic_nms=False, amp=True, augment=False, auto_augment=randaugment, batch=16, bgr=0.0, box=7.5, cache=False, cfg=None, classes=None, close_mosaic=10, cls=0.5, conf=None, copy_paste=0.0, copy_paste_mode=flip, cos_lr=False, cutmix=0.0, data=/content/combined_dataset/dataset.yaml, degrees=0.0, deterministic=True, device=None, dfl=1.5, dnn=False, dropout=0.0, dynamic=False, embed=None, epochs=20, erasing=0.4, exist_ok=False, fliplr=0.5, flipud=0.0, format=torchscript, fraction=1.0, freeze=None, half=False, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, imgsz=640, int8=False, iou=0.7, keras=False, kobj=1.0, line_width=None, lr0=0.01, lrf=0.01, mask_ratio=4, max_det=300, mixup=0.0, mode=train, model=yolo11s.pt, momentum=0.937, mosaic=1.0, multi_sca

# Save Model and Results

## Export model in ONNX format( for ease of deployment)

In [11]:
!yolo export model=/content/runs/detect/train/weights/best.pt format=onnx

Ultralytics 8.3.134 🚀 Python-3.11.12 torch-2.6.0+cu124 CPU (Intel Xeon 2.00GHz)
💡 ProTip: Export to OpenVINO format for best performance on Intel CPUs. Learn more at https://docs.ultralytics.com/integrations/openvino/
YOLO11s summary (fused): 100 layers, 9,413,961 parameters, 0 gradients, 21.3 GFLOPs

[34m[1mPyTorch:[0m starting from '/content/runs/detect/train/weights/best.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s) (1, 7, 8400) (18.3 MB)
[31m[1mrequirements:[0m Ultralytics requirements ['onnx>=1.12.0,<1.18.0', 'onnxslim>=0.1.46', 'onnxruntime'] not found, attempting AutoUpdate...
Collecting onnx<1.18.0,>=1.12.0
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxslim>=0.1.46
  Downloading onnxslim-0.1.53-py3-none-any.whl.metadata (5.0 kB)
Collecting onnxruntime
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting coloredlogs (f

## Download Model Training results

In [None]:
import shutil
from google.colab import files

folder_path = '/content/runs/detect/train'
zip_name = 'run_plots.zip'

# Zip the folder
shutil.make_archive("run_plots", 'zip', folder_path)

# Download the zip file
files.download(zip_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>