# Notebook 1: Preprocessing of Cutaneous Leishmaniasis Ulcer Images

## Purpose
This notebook preprocesses **cutaneous leishmaniasis (CL) ulcer images** for a binary classification task:
- **Class 0 (Sensitive):** CL ulcers showing healing / good treatment response
- **Class 1 (Poor):** CL ulcers showing poor treatment response

## Clinical Data Constraint
‚ö†Ô∏è **The dataset MUST contain ONLY cutaneous leishmaniasis ulcer wounds.**  
Do NOT include: non-disease wounds, traumatic cuts, burns, diabetic ulcers, pressure sores, healthy skin, or any non-CL skin lesions.

## Preprocessing Pipeline
1. Resize images to 224√ó224 pixels
2. Convert RGB ‚Üí CIE LAB color space
3. Extract L (luminosity) channel
4. Apply CLAHE (Contrast Limited Adaptive Histogram Equalization)
5. Apply median filtering for noise reduction
6. Normalize pixel values to [0, 1]

## Output
Preprocessed images saved to `processed_data/sensitive/` and `processed_data/poor/`.

---
**‚ö†Ô∏è This notebook does NOT perform any model training or data splitting.**

## 1. Import Required Libraries

In [None]:
import os
import shutil
import zipfile
import numpy as np
import cv2
import matplotlib.pyplot as plt
from glob import glob

# Google Colab file upload utility
try:
    from google.colab import files
    IN_COLAB = True
except ImportError:
    IN_COLAB = False
    print("Not running in Google Colab. Manual upload will be skipped.")

print("All libraries imported successfully.")

## 2. Upload Dataset (Manual Upload)

### Instructions
1. Prepare your CL ulcer images in TWO folders: `sensitive/` and `poor/`
2. Place both folders inside a parent folder called `dataset/`
3. Compress `dataset/` into a **ZIP file** (e.g., `dataset.zip`)
4. Run the cell below ‚Äî a file-upload dialog will appear
5. Select and upload your `dataset.zip`

### Expected ZIP Structure
```
dataset.zip
  ‚îî‚îÄ‚îÄ dataset/
        ‚îú‚îÄ‚îÄ sensitive/    ‚Üê CL ulcers showing healing / good response
        ‚îÇ     ‚îú‚îÄ‚îÄ img001.jpg
        ‚îÇ     ‚îú‚îÄ‚îÄ img002.png
        ‚îÇ     ‚îî‚îÄ‚îÄ ...
        ‚îî‚îÄ‚îÄ poor/         ‚Üê CL ulcers showing poor treatment response
              ‚îú‚îÄ‚îÄ img001.jpg
              ‚îú‚îÄ‚îÄ img002.png
              ‚îî‚îÄ‚îÄ ...
```

**‚ö†Ô∏è Clinical Reminder:**  
Ensure ALL images are **cutaneous leishmaniasis ulcer wounds ONLY**.  
Do NOT include non-CL wounds, healthy skin, burns, diabetic ulcers, or any other lesion type.

In [None]:
# ============================================================
# DATASET UPLOAD
# Upload your dataset as a ZIP file
# ============================================================

DATASET_DIR = 'dataset'
CLASSES = ['sensitive', 'poor']

if IN_COLAB:
    print("="*50)
    print("  STEP: Upload your dataset ZIP file")
    print("="*50)
    print("Please select your ZIP file containing:")
    print("  dataset/sensitive/  and  dataset/poor/\n")

    uploaded = files.upload()

    # Extract the uploaded zip file
    for filename in uploaded.keys():
        if filename.endswith('.zip'):
            print(f"\nExtracting '{filename}'...")
            with zipfile.ZipFile(filename, 'r') as zip_ref:
                zip_ref.extractall('.')
            print(f"Extraction complete.")
        else:
            print(f"‚ö†Ô∏è Warning: '{filename}' is not a ZIP file. Please upload a .zip file.")
else:
    print("Not in Colab. Ensure 'dataset/' folder exists in the working directory.")

# --------------------------------------------------
# AUTO-DETECT DATASET DIRECTORY
# Handles different ZIP structures:
#   Case 1: dataset/sensitive/ + dataset/poor/  (expected)
#   Case 2: sensitive/ + poor/ at root           (no parent)
#   Case 3: some_folder/sensitive/ + poor/       (different name)
# --------------------------------------------------

def find_dataset_dir(expected_name, required_subdirs):
    """
    Auto-detect the dataset directory after ZIP extraction.
    Returns the path to the directory containing the required subdirectories.
    """
    # Case 1: Expected directory exists with correct subdirs
    if os.path.isdir(expected_name):
        if all(os.path.isdir(os.path.join(expected_name, s)) for s in required_subdirs):
            return expected_name

    # Case 2: Subdirs exist directly in working directory
    if all(os.path.isdir(s) for s in required_subdirs):
        os.makedirs(expected_name, exist_ok=True)
        for s in required_subdirs:
            dest = os.path.join(expected_name, s)
            if not os.path.exists(dest):
                shutil.move(s, dest)
        print(f"  Detected subdirs at root level ‚Äî moved into '{expected_name}/'")
        return expected_name

    # Case 3: Search for any directory containing both required subdirs
    for root, dirs, _ in os.walk('.'):
        # Skip hidden/system directories
        dirs[:] = [d for d in dirs if not d.startswith('.') and d != '__MACOSX']
        if all(s in dirs for s in required_subdirs):
            found_path = root
            if found_path != '.':
                print(f"  Auto-detected dataset directory: '{found_path}'")
                return found_path

    raise FileNotFoundError(
        f"Could not find a directory containing {required_subdirs}.\n"
        f"Please ensure your ZIP file contains a folder with "
        f"'{required_subdirs[0]}/' and '{required_subdirs[1]}/' subdirectories."
    )

DATASET_DIR = find_dataset_dir(DATASET_DIR, CLASSES)
print(f"\n‚úÖ Using dataset directory: '{DATASET_DIR}/'")

## 3. Validate Dataset Structure

Verify folder structure, count images, and warn about any non-image files.

In [None]:
# ============================================================
# DATASET VALIDATION
# ============================================================

VALID_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.bmp', '.tif', '.tiff'}

# Files to ignore (OS-generated junk files)
IGNORE_FILES = {'.ds_store', 'thumbs.db', 'desktop.ini', '.gitkeep'}


def is_valid_image_file(filename):
    """Check if a filename is a valid image (not a hidden/system file)."""
    name_lower = filename.lower()
    # Skip hidden files, system files, and __MACOSX junk
    if filename.startswith('.') or name_lower in IGNORE_FILES:
        return False
    ext = os.path.splitext(filename)[1].lower()
    return ext in VALID_EXTENSIONS


def validate_dataset(dataset_dir, classes):
    """
    Validate the dataset directory structure and count images.

    Clinical Note: This checks file types only. It CANNOT verify
    clinical content ‚Äî the user MUST ensure images are CL ulcers ONLY.
    """
    if not os.path.isdir(dataset_dir):
        raise FileNotFoundError(
            f"Dataset directory '{dataset_dir}' not found. "
            f"Please upload and extract the dataset first."
        )

    total_images = 0
    class_counts = {}

    for cls in classes:
        cls_path = os.path.join(dataset_dir, cls)
        if not os.path.isdir(cls_path):
            raise FileNotFoundError(
                f"Class directory '{cls_path}' not found.\n"
                f"Expected subdirectories: {classes}"
            )

        # Count valid image files
        image_files = [f for f in os.listdir(cls_path) if is_valid_image_file(f)]
        count = len(image_files)
        total_images += count
        class_counts[cls] = count

        # Warn about non-image files (excluding known junk)
        all_files = os.listdir(cls_path)
        skipped = [f for f in all_files if not is_valid_image_file(f) and f.lower() not in IGNORE_FILES and not f.startswith('.')]
        if skipped:
            print(f"  ‚ö†Ô∏è  Non-image files in '{cls}/': {skipped[:5]}{'...' if len(skipped) > 5 else ''}")

        print(f"  Class '{cls}': {count} valid image(s)")

    if total_images == 0:
        raise ValueError(
            "No valid images found in the dataset!\n"
            "Supported formats: .jpg, .jpeg, .png, .bmp, .tif, .tiff"
        )

    # Warn about severe class imbalance
    counts = list(class_counts.values())
    if min(counts) > 0 and max(counts) / min(counts) > 5:
        print(f"\n  ‚ö†Ô∏è  WARNING: Severe class imbalance detected!")
        print(f"     This may affect model training performance.")

    print(f"\n  Total valid images: {total_images}")
    print("\n" + "="*50)
    print("‚ö†Ô∏è  CLINICAL REMINDER:")
    print("  Ensure ALL images are cutaneous leishmaniasis")
    print("  ulcer wounds ONLY. Do NOT include:")
    print("  - Non-CL wounds, burns, diabetic ulcers")
    print("  - Pressure sores, healthy skin")
    print("  - Any non-leishmaniasis skin lesions")
    print("="*50)
    return True


print("Validating dataset structure...\n")
validate_dataset(DATASET_DIR, CLASSES)
print("\n‚úÖ Dataset validation passed.")

## 4. Define Preprocessing Functions

### Preprocessing Pipeline (Medical Image Processing)

Each CL ulcer image undergoes the following steps:

| Step | Operation | Medical Rationale |
|------|-----------|-------------------|
| 1 | Resize to 224√ó224 | Standard deep learning input size |
| 2 | RGB ‚Üí CIE LAB | Separates luminosity from color information |
| 3 | Extract L channel | Captures ulcer structure independent of skin tone |
| 4 | CLAHE | Enhances ulcer borders and tissue texture contrast |
| 5 | Median filter | Removes noise while preserving ulcer edges |
| 6 | Normalize [0,1] | Required for stable neural network training |

In [None]:
# ============================================================
# PREPROCESSING FUNCTIONS
# Medical image preprocessing pipeline for CL ulcer images
# ============================================================

# Target image dimensions (standard for deep learning models)
IMG_SIZE = (224, 224)

# CLAHE parameters
# clipLimit=2.0: controls contrast amplification ‚Äî good default for medical images
# tileGridSize=(8,8): divides image into 8x8 tiles for local equalization
CLAHE_CLIP_LIMIT = 2.0
CLAHE_TILE_SIZE = (8, 8)

# Median filter kernel size (must be odd; 5 balances noise removal + edge preservation)
MEDIAN_KERNEL_SIZE = 5


def preprocess_image(image_path):
    """
    Apply the full preprocessing pipeline to a single CL ulcer image.

    Pipeline:
      1. Load and resize to 224x224
      2. Convert RGB ‚Üí CIE LAB color space
      3. Extract L (luminosity) channel
      4. Apply CLAHE for contrast enhancement
      5. Apply median filtering for noise reduction
      6. Normalize pixel values to [0, 1]

    Args:
        image_path (str): Path to the input image file.

    Returns:
        tuple: (processed_image, original_rgb) or (None, None) on failure.
            processed_image: numpy.ndarray, shape (224,224), float32, [0,1]
            original_rgb: numpy.ndarray, shape (224,224,3), uint8, [0,255]
    """
    # Step 1: Load image (OpenCV loads as BGR)
    img_bgr = cv2.imread(image_path)

    if img_bgr is None:
        print(f"  ‚ö†Ô∏è  Could not load image: {os.path.basename(image_path)}")
        print(f"     Skipping ‚Äî verify it is a valid image file.")
        return None, None

    # Resize to 224x224
    img_bgr = cv2.resize(img_bgr, IMG_SIZE, interpolation=cv2.INTER_AREA)

    # Keep RGB copy for visualization
    original_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)

    # Step 2: Convert BGR ‚Üí CIE LAB color space
    # LAB separates lightness (L) from color (A, B).
    # This is ideal for medical imaging where intensity patterns
    # carry clinical information about ulcer healing status.
    img_lab = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2LAB)

    # Step 3: Extract L (luminosity) channel
    # The L channel captures structural/textural detail of the ulcer
    # independent of color variation due to lighting or skin tone.
    l_channel = img_lab[:, :, 0]

    # Step 4: Apply CLAHE
    # Enhances local contrast ‚Äî makes ulcer borders and tissue
    # texture differences more visible for the classifier.
    clahe = cv2.createCLAHE(clipLimit=CLAHE_CLIP_LIMIT, tileGridSize=CLAHE_TILE_SIZE)
    l_clahe = clahe.apply(l_channel)

    # Step 5: Median filtering
    # Removes salt-and-pepper noise from camera artifacts
    # while preserving ulcer boundary edges.
    l_filtered = cv2.medianBlur(l_clahe, MEDIAN_KERNEL_SIZE)

    # Step 6: Normalize to [0, 1] range
    # Required for stable neural network training.
    processed = l_filtered.astype(np.float32) / 255.0

    return processed, original_rgb


def save_processed_image(image, save_path):
    """
    Save a preprocessed image (float32, [0,1]) as a PNG file.
    Converts back to uint8 [0,255] for saving.
    """
    img_uint8 = (image * 255.0).astype(np.uint8)
    success = cv2.imwrite(save_path, img_uint8)
    if not success:
        print(f"  ‚ö†Ô∏è  Failed to save: {save_path}")
    return success


print("‚úÖ Preprocessing functions defined.")

## 5. Run Preprocessing Pipeline

Process all CL ulcer images and save to `processed_data/`.

In [None]:
# ============================================================
# RUN PREPROCESSING ON FULL DATASET
# ============================================================

OUTPUT_DIR = 'processed_data'

# Store first sample for visualization
sample_original = None
sample_processed = None
sample_name = None

total_success = 0
total_fail = 0

for cls in CLASSES:
    input_dir = os.path.join(DATASET_DIR, cls)
    output_dir = os.path.join(OUTPUT_DIR, cls)
    os.makedirs(output_dir, exist_ok=True)

    # Get valid image files (skip hidden/system files)
    image_files = sorted([f for f in os.listdir(input_dir) if is_valid_image_file(f)])

    print(f"\nProcessing class '{cls}': {len(image_files)} images...")

    success_count = 0
    fail_count = 0

    for i, fname in enumerate(image_files):
        input_path = os.path.join(input_dir, fname)

        # Apply preprocessing
        processed, original = preprocess_image(input_path)

        if processed is None:
            fail_count += 1
            continue

        # Save as PNG
        output_fname = os.path.splitext(fname)[0] + '.png'
        output_path = os.path.join(output_dir, output_fname)
        save_processed_image(processed, output_path)
        success_count += 1

        # Store first sample for visualization
        if sample_original is None:
            sample_original = original
            sample_processed = processed
            sample_name = fname

        # Progress every 10 images
        if (i + 1) % 10 == 0 or (i + 1) == len(image_files):
            print(f"  [{i + 1}/{len(image_files)}] processed")

    total_success += success_count
    total_fail += fail_count
    print(f"  ‚úÖ '{cls}': {success_count} succeeded, {fail_count} failed")

print(f"\n{'='*50}")
print(f"  PREPROCESSING COMPLETE")
print(f"{'='*50}")
print(f"  Total processed: {total_success}")
print(f"  Total failed:    {total_fail}")
print(f"  Output folder:   '{OUTPUT_DIR}/'")
for cls in CLASSES:
    out_dir = os.path.join(OUTPUT_DIR, cls)
    count = len([f for f in os.listdir(out_dir) if is_valid_image_file(f)])
    print(f"    {cls}: {count} images")

## 6. Visualization: Original vs Preprocessed

Display one CL ulcer image before and after the preprocessing pipeline.

In [None]:
# ============================================================
# VISUALIZATION: Side-by-side comparison
# ============================================================

if sample_original is not None and sample_processed is not None:
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Original RGB image
    axes[0].imshow(sample_original)
    axes[0].set_title(f'Original CL Ulcer Image\n({sample_name})', fontsize=12)
    axes[0].axis('off')

    # Preprocessed image (L channel, CLAHE, median, normalized)
    axes[1].imshow(sample_processed, cmap='gray', vmin=0, vmax=1)
    axes[1].set_title('Preprocessed\n(LAB-L + CLAHE + Median + Normalized)', fontsize=12)
    axes[1].axis('off')

    plt.suptitle('Cutaneous Leishmaniasis Ulcer ‚Äî Preprocessing Result',
                 fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

    print(f"Original shape:      {sample_original.shape}")
    print(f"Preprocessed shape:  {sample_processed.shape}")
    print(f"Pixel value range:   [{sample_processed.min():.4f}, {sample_processed.max():.4f}]")
else:
    print("‚ö†Ô∏è  No images were successfully processed. Please check your dataset.")

## 7. Download Processed Data

Zip and download the `processed_data/` folder.  
You will need this file in **Notebook 2** (`model_training.ipynb`).

In [None]:
# ============================================================
# ZIP AND DOWNLOAD processed_data/
# ============================================================

ZIP_NAME = 'processed_data'

# Create zip
shutil.make_archive(ZIP_NAME, 'zip', '.', 'processed_data')
zip_path = ZIP_NAME + '.zip'
zip_size_mb = os.path.getsize(zip_path) / (1024 * 1024)
print(f"Created: {zip_path} ({zip_size_mb:.2f} MB)")

# Download in Colab
if IN_COLAB:
    files.download(zip_path)
    print("\nüì• Download started.")
else:
    print(f"\nFile saved as '{zip_path}' in the working directory.")

print("\n" + "="*50)
print("  NOTEBOOK 1 COMPLETE")
print("="*50)
print("\nNext step:")
print("  1. Open model_training.ipynb")
print("  2. Upload processed_data.zip when prompted")