# ⚙️ Pre-Processing

> See [README-file](../README.md) for more information on how to set up the project.

This notebook includes steps that prepare the dataset for training and evaluation, including restructuring of the dataset and applying masks to the input images.

## Dataset Structure

The original dataset is structured in a way that is not optimal for training machine learning models. The updated structure includes separate folders for training, validation, and testing, as well as subfolders for images, masks, as well as the masked images. The rate of train/val/test split is `72/18/10` (classes being stratified).

```
# Original structure
- data/
    - COVID/
        - images/
        - masks/
    - ...

# New structure
- data_split/
    - images/
        - train/
            - COVID/
            - Lung_Opacity/
            - Normal/
            - Viral Pneumonia/
        - val/
            - ...
        - test/
            - ...
    - masks/
        - train/
            - ...
        - val/
            - ...
        - test/
            - ...
    - masked_images/
        - train/
            - ...
        - val/
            - ...
        - test/
            - ...
```

## Processing Steps

1. **Environment Detection**: Automatically detect whether we're running in Google Colab or local environment
2. **Dataset Restructuring**: Split the original dataset into train/validation/test sets with stratified sampling
3. **Mask Application**: Apply masks to create focused images for model training

In [21]:
# Automatically reload modules before executing code.
%reload_ext autoreload
%autoreload 2

import logging
import sys
from os import path

# Add src directory to path for imports
sys.path.append(path.join("..", "src"))

from src.util.environment import setup_environment, get_dataset_config
from src.util.dataset import restructure_dataset
from src.util.image_processing import create_masked_images

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(levelname)s - %(message)s")

print("🚀 Starting pre-processing setup...")

🚀 Starting pre-processing setup...


In [22]:
# 🌍 Environment Detection and Setup
print("🔍 Detecting environment and setting up paths...")

# Setup environment (detects Colab vs local, mounts drive if needed, loads config)
is_google_colab, config, paths = setup_environment()

# Get dataset configuration
dataset_config = get_dataset_config()

print(f"📍 Environment: {'Google Colab' if is_google_colab else 'Local'}")
print(f"📂 Data path: {paths['data_path']}")
print(f"📂 Split data path: {paths['split_data_path']}")
print(f"🏷️  Dataset classes: {dataset_config['classes']}")
print(f"📊 Split ratio: {dataset_config['split_ratio']}")

# Check if source data exists
import os

if os.path.exists(paths["data_path"]):
    print(f"✅ Source data directory found")
else:
    print(f"❌ Source data directory not found: {paths['data_path']}")
    print(
        "Please ensure the dataset is downloaded and placed in the correct"
        " location."
    )

INFO - Not running in Google Colab.
INFO - Environment setup complete. Running in local environment
INFO - Paths: {'data_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data', 'split_data_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split', 'models_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/models', 'image_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split/images', 'test_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split/images/test'}
INFO - Environment setup complete. Running in local environment
INFO - Paths: {'data_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data', 'split_data_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split', 'models_path': '/Users/fabianhofmann/⌨️_development/git_projects/covid_classification/models', 'image_path': 

🔍 Detecting environment and setting up paths...
📍 Environment: Local
📂 Data path: /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data
📂 Split data path: /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split
🏷️  Dataset classes: ['COVID', 'Lung_Opacity', 'Normal', 'Viral Pneumonia']
📊 Split ratio: (0.72, 0.18, 0.1)
✅ Source data directory found


In [16]:
# 🔄 Dataset Restructuring
print("🔄 Starting dataset restructuring...")

try:
    restructure_dataset(
        source_dir=paths["data_path"],
        target_dir=paths["split_data_path"],
        dataset_classes=dataset_config["classes"],
        dataset_categories=dataset_config["categories"],
        split_ratio=dataset_config["split_ratio"],
        random_seed=dataset_config["random_seed"],
    )
    print("✅ Dataset restructuring completed successfully!")
except Exception as e:
    print(f"❌ Error during dataset restructuring: {str(e)}")

INFO - > Restructuring dataset from /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data to /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split with split ratio (0.72, 0.18, 0.1).
INFO - 	- Removing existing target directory: /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split
INFO - 	- Removing existing target directory: /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split


🔄 Starting dataset restructuring...


INFO - 	- Class COVID: 2603 train, 650 val, 363 test files
INFO - 	- Class Lung_Opacity: 4328 train, 1082 val, 602 test files
INFO - 	- Class Lung_Opacity: 4328 train, 1082 val, 602 test files
INFO - 	- Class Normal: 7338 train, 1834 val, 1020 test files
INFO - 	- Class Normal: 7338 train, 1834 val, 1020 test files
INFO - 	- Class Viral Pneumonia: 968 train, 242 val, 135 test files
INFO - 	- Class Viral Pneumonia: 968 train, 242 val, 135 test files
INFO - 	- Dataset restructuring completed. Files copied to /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split.
INFO - 	- Dataset restructuring completed. Files copied to /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split.


✅ Dataset restructuring completed successfully!


In [19]:
# 🎭 Creating Masked Images
print("🎭 Creating masked images...")

try:
    create_masked_images(
        split_data_path=paths["split_data_path"],
        dataset_classes=dataset_config["classes"],
    )
    print("✅ Masked images creation completed successfully!")
except Exception as e:
    print(f"❌ Error during masked images creation: {str(e)}")

INFO - Creating masked images in /Users/fabianhofmann/⌨️_development/git_projects/covid_classification/data_split


🎭 Creating masked images...


INFO - Processed 2603 images for class COVID, split train
INFO - Processed 4328 images for class Lung_Opacity, split train
INFO - Processed 4328 images for class Lung_Opacity, split train
INFO - Processed 7338 images for class Normal, split train
INFO - Processed 7338 images for class Normal, split train
INFO - Processed 968 images for class Viral Pneumonia, split train
INFO - Processed 968 images for class Viral Pneumonia, split train
INFO - Processed 650 images for class COVID, split val
INFO - Processed 650 images for class COVID, split val
INFO - Processed 1082 images for class Lung_Opacity, split val
INFO - Processed 1082 images for class Lung_Opacity, split val
INFO - Processed 1834 images for class Normal, split val
INFO - Processed 1834 images for class Normal, split val
INFO - Processed 242 images for class Viral Pneumonia, split val
INFO - Processed 242 images for class Viral Pneumonia, split val
INFO - Processed 363 images for class COVID, split test
INFO - Processed 363 ima

✅ Masked images creation completed successfully!


In [20]:
# 📊 Dataset Summary and Verification
print("📊 Generating dataset summary...")

import os
from collections import defaultdict


def count_files_in_directory(directory):
    """Count files in a directory structure."""
    counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))

    if not os.path.exists(directory):
        print(f"Directory {directory} does not exist")
        return counts

    for category in ["images", "masks", "masked_images"]:
        category_path = os.path.join(directory, category)
        if os.path.exists(category_path):
            for split in ["train", "val", "test"]:
                split_path = os.path.join(category_path, split)
                if os.path.exists(split_path):
                    for cls in dataset_config["classes"]:
                        cls_path = os.path.join(split_path, cls)
                        if os.path.exists(cls_path):
                            file_count = len([
                                f
                                for f in os.listdir(cls_path)
                                if f.lower().endswith((".png", ".jpg", ".jpeg"))
                            ])
                            counts[category][split][cls] = file_count

    return counts


# Count files in the processed dataset
if os.path.exists(paths["split_data_path"]):
    counts = count_files_in_directory(paths["split_data_path"])

    print("\n📈 Dataset Statistics:")
    print("=" * 60)

    for category in ["images", "masks", "masked_images"]:
        if category in counts:
            print(f"\n{category.upper()}:")
            print("-" * 30)
            for split in ["train", "val", "test"]:
                if split in counts[category]:
                    total = sum(counts[category][split].values())
                    print(f"{split.capitalize()}: {total} files")
                    for cls in dataset_config["classes"]:
                        count = counts[category][split].get(cls, 0)
                        print(f"  - {cls}: {count}")

    print("\n✅ Pre-processing pipeline completed successfully!")
    print(f"📂 Processed dataset available at: {paths['split_data_path']}")
else:
    print(f"❌ Split data directory not found: {paths['split_data_path']}")

📊 Generating dataset summary...

📈 Dataset Statistics:

IMAGES:
------------------------------
Train: 15237 files
  - COVID: 2603
  - Lung_Opacity: 4328
  - Normal: 7338
  - Viral Pneumonia: 968
Val: 3808 files
  - COVID: 650
  - Lung_Opacity: 1082
  - Normal: 1834
  - Viral Pneumonia: 242
Test: 2120 files
  - COVID: 363
  - Lung_Opacity: 602
  - Normal: 1020
  - Viral Pneumonia: 135

MASKS:
------------------------------
Train: 15237 files
  - COVID: 2603
  - Lung_Opacity: 4328
  - Normal: 7338
  - Viral Pneumonia: 968
Val: 3808 files
  - COVID: 650
  - Lung_Opacity: 1082
  - Normal: 1834
  - Viral Pneumonia: 242
Test: 2120 files
  - COVID: 363
  - Lung_Opacity: 602
  - Normal: 1020
  - Viral Pneumonia: 135

MASKED_IMAGES:
------------------------------
Train: 15237 files
  - COVID: 2603
  - Lung_Opacity: 4328
  - Normal: 7338
  - Viral Pneumonia: 968
Val: 3808 files
  - COVID: 650
  - Lung_Opacity: 1082
  - Normal: 1834
  - Viral Pneumonia: 242
Test: 2120 files
  - COVID: 363
  - Lun