- In order to download dataset, go to Kaggle account -> API section -> Click on Create New API Token. This will download a file named kaggle.json.

- Then go to colab and select the "key" on the left side, where click on "add new secret" where you need to provide Name and Value for KAGGLE_USERNAME and KAGGLE_KEY. Do this for both (you need to click "add new secret twice, one for username and other for key"). Make sure to tick the Notebook access option as well.

- Here, these steps occur

Step 0A: Setup: It installs the Augmentor library into your Colab environment. This gives us the tools needed for the main task.

Step 0B: Data Download: It securely uses your Kaggle credentials to download the entire PlantVillage dataset into your Colab session. It then defines two critical paths:

SOURCE_DIRECTORY: The location of the original, unbalanced grayscale images.

OUTPUT_DIRECTORY: The new, empty folder where your final, balanced dataset will be built.

Step 1: Analysis: The code acts like a census taker. It goes through every one of the 38 subfolders in the SOURCE_DIRECTORY, counts how many images are in each, and stores this information. It then finds the folder with the most images and sets that number as the TARGET_COUNT. This is the goal that all other folders must reach.

Step 2: The Balancing Loop: This is the core of the operation and what is running right now. The code iterates through every single class (e.g., 'Tomato___Bacterial_spot', 'Potato___Early_blight', etc.). For each class, it performs the following sub-steps:

Creates a new home: It makes a corresponding subfolder in the OUTPUT_DIRECTORY.

Copies the originals: It copies all the original images from the source folder to this new folder. This ensures your final dataset contains all the original data.

Makes a decision (This answers your next question): It checks if the number of images for this class is less than the TARGET_COUNT.

Generates new images (if necessary): If the class is smaller than the target, it creates an Augmentor pipeline. This pipeline defines a set of random transformations (rotate, zoom, flip, change brightness/contrast). It then tells the pipeline to generate exactly the number of new images needed to reach the TARGET_COUNT. These new images are saved directly into the new output folder for that class.

Skips if not needed: If the class is already at the target size, it does nothing and moves to the next class.

Step 3: Final Verification: After the main loop is finished, this last step runs. It goes through your new OUTPUT_DIRECTORY and counts the images in every subfolder one last time. It then prints a final report to confirm that every single class has been successfully balanced to the TARGET_COUNT.

- Running this code might take 30-40 minutes.


In [None]:
!pip install kaggle kagglehub -q

import os
from google.colab import userdata
import kagglehub

# Set Kaggle API credentials from Colab secrets (this part is correct)
try:
    os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
    os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
except:
    print("Kaggle credentials not found in Colab secrets. Please add them.")
    # You might want to stop the script here if credentials are required
    # raise ValueError("Kaggle credentials not found.")

print("Downloading dataset...")
# This function correctly downloads and mounts the data as a folder
path = kagglehub.dataset_download("abdallahalidev/plantvillage-dataset")

print(f"Dataset is ready and mounted at the path: {path}")

# --- IMPORTANT ---
# We have REMOVED the entire zipfile section because it is not needed.

# Now, let's define the path to the grayscale images we will use for the project.
# This SOURCE_DIRECTORY variable is what the next code blocks will use.
SOURCE_DIRECTORY = os.path.join(path, 'plantvillage dataset', 'grayscale')

# --- Verification Step ---
# Let's check if this path is correct by listing its contents.
print("\nVerifying the source directory path...")
if os.path.exists(SOURCE_DIRECTORY):
    print(f"SUCCESS: The source directory is correctly set to:\n'{SOURCE_DIRECTORY}'")
    print("\nHere are some of the class folders inside:")
    # List the first 10 folders to confirm
    print(os.listdir(SOURCE_DIRECTORY)[:10])
else:
    print(f"ERROR: The directory was not found at '{SOURCE_DIRECTORY}'. Please check the path.")

In [None]:
# ==============================================================================
# THE DEFINITIVE AUGMENTOR WORKFLOW
# This single cell will handle everything: setup, download, analysis, and balancing.
# ==============================================================================

# ------------------------------------------------------------------------------
# STEP 0A: SETUP AND INSTALLATION
# ------------------------------------------------------------------------------
print("--> Step 0A: Installing Augmentor and required libraries...")
# We use -q for a cleaner output, as we are confident in this installation.
!pip install Augmentor kagglehub -q
print("--> Done.")

import Augmentor
import os
import glob
import shutil
from collections import defaultdict
from google.colab import userdata

print("\n------------------------------------------------------------------------------\n")

# ------------------------------------------------------------------------------
# STEP 0B: DOWNLOAD DATA AND DEFINE PATHS
# ------------------------------------------------------------------------------
print("--> Step 0B: Downloading dataset and defining paths...")

# Set Kaggle API credentials
try:
    os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
    os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
except Exception as e:
    print(f"    - Warning: Could not set Kaggle credentials from secrets: {e}")

# Download the dataset
path = kagglehub.dataset_download("abdallahalidev/plantvillage-dataset")

# Define the source directory for our project (grayscale images)
SOURCE_DIRECTORY = os.path.join(path, 'plantvillage dataset', 'grayscale')
# Define where to save our new, balanced dataset
OUTPUT_DIRECTORY = '/content/plant_village_balanced_augmentor/'

if os.path.exists(SOURCE_DIRECTORY):
    print(f"--> SUCCESS: Source data is ready at '{SOURCE_DIRECTORY}'")
    print(f"--> The new balanced dataset will be created at '{OUTPUT_DIRECTORY}'")
else:
    print(f"--> FATAL ERROR: Source directory was not found. Halting execution.")
    # This stops the script if the data isn't there
    sys.exit()

print("\n------------------------------------------------------------------------------\n")

# ------------------------------------------------------------------------------
# STEP 1: ANALYZE CLASS IMBALANCE
# ------------------------------------------------------------------------------
print("--> Step 1: Analyzing class distribution...")

class_counts = defaultdict(int)
for class_folder in os.listdir(SOURCE_DIRECTORY):
    folder_path = os.path.join(SOURCE_DIRECTORY, class_folder)
    if os.path.isdir(folder_path):
        num_images = len(glob.glob(os.path.join(folder_path, '*.JPG')))
        class_counts[class_folder] = num_images

if not class_counts:
    print("--> FATAL ERROR: No class folders found in source directory.")
    sys.exit()
else:
    sorted_classes = sorted(class_counts.items(), key=lambda item: item[1])
    TARGET_COUNT = sorted_classes[-1][1]
    print(f"--> Analysis complete. Target image count per class is: {TARGET_COUNT}")
    print(f"    (Based on the largest class: '{sorted_classes[-1][0]}')")

print("\n------------------------------------------------------------------------------\n")

# ------------------------------------------------------------------------------
# STEP 2: PERFORM BALANCING WITH AUGMENTOR
# ------------------------------------------------------------------------------
print("--> Step 2: Starting dataset balancing using Augmentor...")

# Create the main output directory
os.makedirs(OUTPUT_DIRECTORY, exist_ok=True)

# Loop through all the original class folders
for class_name, current_count in sorted_classes:
    print(f"\n--- Processing class: {class_name} ({current_count} images) ---")
    
    # Define paths for this specific class
    source_class_path = os.path.join(SOURCE_DIRECTORY, class_name)
    output_class_path = os.path.join(OUTPUT_DIRECTORY, class_name)
    os.makedirs(output_class_path, exist_ok=True)
    
    # --- 1. Copy original images to the new directory ---
    # Augmentor works best by creating a new, separate dataset
    original_image_paths = glob.glob(os.path.join(source_class_path, '*.JPG'))
    for img_path in original_image_paths:
        shutil.copy(img_path, output_class_path)
    print(f"    - Copied {len(original_image_paths)} original images.")

    # --- 2. Perform augmentation if needed ---
    if current_count < TARGET_COUNT:
        num_to_generate = TARGET_COUNT - current_count
        print(f"    - Augmentation needed. Generating {num_to_generate} new images...")
        
        # Create an Augmentor Pipeline that points to the *final* output directory
        p = Augmentor.Pipeline(output_class_path, output_directory=".", save_format="JPG")
        
        # --- Define the augmentation operations ---
        # These are added to the pipeline. Probabilities control how often they are applied.
        p.rotate(probability=0.7, max_left_rotation=20, max_right_rotation=20)
        p.zoom(probability=0.4, min_factor=0.9, max_factor=1.2)
        p.flip_left_right(probability=0.5)
        p.flip_top_bottom(probability=0.5)
        p.random_contrast(probability=0.6, min_factor=0.8, max_factor=1.2)
        p.random_brightness(probability=0.6, min_factor=0.8, max_factor=1.2)
        
        # --- Execute the pipeline ---
        # The 'sample' function generates the new images on disk.
        p.sample(num_to_generate, multi_threaded=False) # multi_threaded=False is more stable in Colab
        
        print(f"    - Augmentation complete for '{class_name}'.")
    else:
        print("    - Class is at target size. No augmentation needed.")

print("\n------------------------------------------------------------------------------\n")

# ------------------------------------------------------------------------------
# STEP 3: FINAL VERIFICATION
# ------------------------------------------------------------------------------
print("--> Step 3: Verifying the new balanced dataset...")

all_classes_balanced = True
for class_folder in os.listdir(OUTPUT_DIRECTORY):
    count = len(os.listdir(os.path.join(OUTPUT_DIRECTORY, class_folder)))
    print(f"    - {class_folder}: {count} images")
    if count != TARGET_COUNT:
        all_classes_balanced = False

print("\n--- VERIFICATION SUMMARY ---")
if all_classes_balanced:
    print(f"✅ SUCCESS! All classes have been balanced to {TARGET_COUNT} images.")
    print(f"Your balanced dataset is ready at: '{OUTPUT_DIRECTORY}'")
else:
    print("❌ NOTICE: Some classes are not at the target count. Please review the output.")