# Tratamiento del Dataset

Este notebook unifica la l√≥gica de los scripts `cambio.py` y `convert_masks.py`.

**Objetivos:**
1.  **Preprocesamiento (cambio.py):** Generar el dataset YOLO base (im√°genes y etiquetas de segmentaci√≥n) a partir del dataset crudo.
2.  **Conversi√≥n (convert_masks.py):** Generar etiquetas de Bounding Box a partir de las etiquetas de segmentaci√≥n (opcional, pero √∫til para detecci√≥n).

## 1. Configuraci√≥n e Importaciones

In [1]:
import os
import pandas as pd
from PIL import Image
import numpy as np
import random
import cv2

# =============================
# CONFIGURACI√ìN GLOBAL
# =============================
DATASET_ROOT = "./dataset(acuatico)"
OUTPUT_ROOT = "./dataset_yolo"
TARGET_SIZE = (640, 640)

# Proporci√≥n de im√°genes sin corrosi√≥n a mantener (para falsos positivos)
NEGATIVES_RATIO = 0.25
# Clase que nos interesa
TARGET_CLASS = "corrosion"

# Rutas derivadas
IMAGES_DIR = os.path.join(DATASET_ROOT, "images")
MASKS_DIR = os.path.join(DATASET_ROOT, "masks")
SPLIT_CSV = os.path.join(DATASET_ROOT, "train_test_split.csv")

# Rutas de Salida YOLO
YOLO_IMAGES_TRAIN = os.path.join(OUTPUT_ROOT, "images", "train")
YOLO_IMAGES_TEST  = os.path.join(OUTPUT_ROOT, "images", "test")
YOLO_LABELS_TRAIN = os.path.join(OUTPUT_ROOT, "labels", "train")
YOLO_LABELS_TEST  = os.path.join(OUTPUT_ROOT, "labels", "test")

# Rutas de Salida BBox (para la parte 2)
YOLO_LABELS_BBOX_TRAIN = os.path.join(OUTPUT_ROOT, "labels_bbox", "train")
YOLO_LABELS_BBOX_TEST = os.path.join(OUTPUT_ROOT, "labels_bbox", "test")

## 2. Generaci√≥n del Dataset Base (L√≥gica de `cambio.py`)

Esta secci√≥n procesa las im√°genes y m√°scaras, redimensiona y genera etiquetas de segmentaci√≥n.

In [2]:
# Crear directorios de salida
for folder in [YOLO_IMAGES_TRAIN, YOLO_IMAGES_TEST, YOLO_LABELS_TRAIN, YOLO_LABELS_TEST]:
    os.makedirs(folder, exist_ok=True)

# Escribir dataset.yml
dataset_yml_content = f"""
train: ./images/train
val: ./images/test
test: ./images/test

nc: 1
names: ['corrosion']
"""
with open(os.path.join(OUTPUT_ROOT, 'dataset.yml'), 'w') as f:
    f.write(dataset_yml_content)
print(f"‚úÖ Creado {os.path.join(OUTPUT_ROOT, 'dataset.yml')}")

# =============================
# IDENTIFICAR IM√ÅGENES POSITIVAS Y NEGATIVAS REALES
# =============================
print("üîç Identificando im√°genes positivas y negativas reales (verificando contenido de m√°scaras)..." )

true_positive_basenames = set()
true_negative_info = {} # Almacena basename: full_filename para negativos

target_class_path = os.path.join(MASKS_DIR, TARGET_CLASS)
all_image_files_in_dir = os.listdir(IMAGES_DIR)
all_images_info = {os.path.splitext(f)[0]: f for f in all_image_files_in_dir}

if not os.path.isdir(target_class_path):
    print(f"‚ùå Error: El directorio de m√°scaras para '{TARGET_CLASS}' no existe: {target_class_path}")
    # Si el directorio de m√°scaras no existe, todas las im√°genes son efectivamente negativas
    true_negative_info = all_images_info.copy()
else:
    for image_file in all_image_files_in_dir:
        base_name = os.path.splitext(image_file)[0]
        mask_file_path = None
        
        # Buscar el fichero de m√°scara correspondiente
        for f in os.listdir(target_class_path):
            if os.path.splitext(f)[0] == base_name:
                mask_file_path = os.path.join(target_class_path, f)
                break
        
        if mask_file_path:
            try:
                with Image.open(mask_file_path) as mask_img:
                    mask_array = np.array(mask_img.convert("L")) # Convertir a escala de grises
                    if np.sum(mask_array) == 0: # Comprobar si todos los p√≠xeles son negros
                        true_negative_info[base_name] = image_file
                    else:
                        true_positive_basenames.add(base_name)
            except Exception as e:
                print(f"‚ö†Ô∏è  Advertencia: No se pudo procesar la m√°scara {mask_file_path}: {e}. Se asume negativa.")
                true_negative_info[base_name] = image_file
        else:
            # Si no hay ning√∫n fichero de m√°scara, tambi√©n es un negativo verdadero
            true_negative_info[base_name] = image_file

print(f"  - Encontradas {len(true_positive_basenames)} im√°genes realmente positivas.")
print(f"  - Encontradas {len(true_negative_info)} im√°genes realmente negativas.")

‚úÖ Creado ./dataset_yolo\dataset.yml
üîç Identificando im√°genes positivas y negativas reales (verificando contenido de m√°scaras)...
  - Encontradas 209 im√°genes realmente positivas.
  - Encontradas 1684 im√°genes realmente negativas.


In [3]:
# =============================
# LEER SPLIT Y PREPARAR LOOKUP
# =============================
split_df = pd.read_csv(SPLIT_CSV)
# Crear un diccionario para b√∫squeda r√°pida de splits: {basename: split}
split_lookup = {os.path.splitext(row['file_name'])[0]: row['split'].lower() for _, row in split_df.iterrows()}

positives_count = 0
negatives_count = 0

In [4]:
# =============================
# PROCESAR TODAS LAS IM√ÅGENES POSITIVAS
# =============================
print(f"‚öôÔ∏è  Procesando las {len(true_positive_basenames)} im√°genes positivas encontradas...")

for base_name in true_positive_basenames:
    # 1. Determinar el split (train/test)
    split = split_lookup.get(base_name)

    # 2. Si no est√° en el CSV, se asigna a 'train'
    if split not in ["train", "test"]:
        split = "train"

    # 3. Obtener el nombre de fichero original
    image_name = all_images_info.get(base_name)
    if not image_name:
        print(f"‚ö†Ô∏è  Advertencia: No se encontr√≥ el nombre de fichero para el basename {base_name}. Se omite.")
        continue

    positives_count += 1

    # Procesar Imagen
    img_path = os.path.join(IMAGES_DIR, image_name)
    img_out_dir = YOLO_IMAGES_TRAIN if split == "train" else YOLO_IMAGES_TEST
    img_out_path = os.path.join(img_out_dir, image_name)

    try:
        with Image.open(img_path) as img:
            img_resized = img.resize(TARGET_SIZE, Image.Resampling.LANCZOS)
            img_resized.save(img_out_path)
    except FileNotFoundError:
        print(f"‚ö†Ô∏è  Advertencia: No se encontr√≥ la imagen {image_name}. Se omite.")
        positives_count -= 1
        continue

    # Procesar M√°scara (Pol√≠gonos)
    mask_file = None
    for f in os.listdir(target_class_path):
        if os.path.splitext(f)[0] == base_name:
            mask_file = f
            break
    
    if not mask_file:
        continue

    lbl_out_dir = YOLO_LABELS_TRAIN if split == "train" else YOLO_LABELS_TEST
    label_lines = []

    mask_path = os.path.join(target_class_path, mask_file)
    with Image.open(mask_path) as mask:
        mask_gray = mask.convert("L")
        mask_resized = mask_gray.resize(TARGET_SIZE, Image.Resampling.NEAREST)
        mask_array = np.array(mask_resized)

        _, binary_mask = cv2.threshold(mask_array, 0, 255, cv2.THRESH_BINARY)
        contours, _ = cv2.findContours(binary_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

        if contours:
            for contour in contours:
                if contour.shape[0] < 3:
                    continue
                
                normalized_contour = contour.squeeze(axis=1).astype(float)
                normalized_contour[:, 0] /= TARGET_SIZE[0]
                normalized_contour[:, 1] /= TARGET_SIZE[1]
                segment = normalized_contour.ravel().tolist()
                label_lines.append(f"0 {' '.join(f'{p:.6f}' for p in segment)}")

    if label_lines:
        label_path = os.path.join(lbl_out_dir, f"{base_name}.txt")
        with open(label_path, "w") as f:
            f.write("\n".join(label_lines))

‚öôÔ∏è  Procesando las 209 im√°genes positivas encontradas...


In [5]:
# =============================
# PROCESAR IM√ÅGENES NEGATIVAS
# =============================
print("‚öôÔ∏è  Procesando im√°genes negativas...")

target_negatives = int(positives_count * NEGATIVES_RATIO)
print(f"  - Objetivo: A√±adir {target_negatives} im√°genes negativas ({NEGATIVES_RATIO:.0%} de los {positives_count} positivos).")

shuffled_negative_basenames = list(true_negative_info.keys())
random.shuffle(shuffled_negative_basenames)

selected_negatives = shuffled_negative_basenames[:target_negatives]

for base_name in selected_negatives:
    negatives_count += 1
    split = "train" if random.random() < 0.8 else "test"
    
    image_name = true_negative_info[base_name]
    img_path = os.path.join(IMAGES_DIR, image_name)
    img_out_dir = YOLO_IMAGES_TRAIN if split == "train" else YOLO_IMAGES_TEST
    img_out_path = os.path.join(img_out_dir, image_name)

    try:
        with Image.open(img_path) as img:
            img_resized = img.resize(TARGET_SIZE, Image.Resampling.LANCZOS)
            img_resized.save(img_out_path)
    except FileNotFoundError:
        print(f"‚ö†Ô∏è  Advertencia: No se encontr√≥ la imagen negativa {image_name}. Se omite.")
        negatives_count -= 1
        continue

print(f"‚úÖ Dataset YOLO creado con √©xito en: {OUTPUT_ROOT}")
print(f"   - Im√°genes con corrosi√≥n (positivos): {positives_count}")
print(f"   - Im√°genes sin corrosi√≥n (negativos): {negatives_count}")

‚öôÔ∏è  Procesando im√°genes negativas...
  - Objetivo: A√±adir 52 im√°genes negativas (25% de los 209 positivos).
‚úÖ Dataset YOLO creado con √©xito en: ./dataset_yolo
   - Im√°genes con corrosi√≥n (positivos): 209
   - Im√°genes sin corrosi√≥n (negativos): 52


## 3. Conversi√≥n de M√°scaras a Bounding Boxes (L√≥gica de `convert_masks.py`)

Esta secci√≥n es opcional. Ejec√∫tala si necesitas las etiquetas en formato de caja (BBox) para detecci√≥n de objetos.

In [6]:
def convert_masks_to_bboxes_inplace(labels_dir):
    if not os.path.exists(labels_dir):
        print(f"Advertencia: El directorio de etiquetas no existe: {labels_dir}")
        return

    print(f"üîÑ Convirtiendo m√°scaras a Bounding Boxes IN-PLACE en {labels_dir}...")
    count = 0
    for filename in os.listdir(labels_dir):
        if filename.endswith(".txt"):
            filepath = os.path.join(labels_dir, filename)

            try:
                # Leer contenido original (Segmentaci√≥n)
                with open(filepath, 'r') as f:
                    lines = f.readlines()

                new_lines = []
                for line in lines:
                    parts = line.strip().split()
                    if len(parts) < 3: continue
                    
                    class_id = parts[0]
                    # YOLO Segmentation: class x1 y1 x2 y2 ... xn yn
                    coords = np.array([float(p) for p in parts[1:]]).reshape(-1, 2)

                    x_min = np.min(coords[:, 0])
                    y_min = np.min(coords[:, 1])
                    x_max = np.max(coords[:, 0])
                    y_max = np.max(coords[:, 1])

                    x_center = (x_min + x_max) / 2
                    y_center = (y_min + y_max) / 2
                    width = x_max - x_min
                    height = y_max - y_min

                    # YOLO Detection: class x_center y_center width height
                    new_lines.append(f"{class_id} {x_center:.6f} {y_center:.6f} {width:.6f} {height:.6f}")
                
                # Sobrescribir el archivo con el nuevo contenido (BBox)
                if new_lines:
                    with open(filepath, 'w') as f:
                        f.write("\n".join(new_lines))
                    count += 1
            except Exception as e:
                print(f"‚ùå Error convirtiendo {filename}: {e}")
    
    print(f"  ‚úÖ {count} archivos convertidos y actualizados a Bounding Box.")

In [7]:
# Ejecutar conversi√≥n IN-PLACE para Train y Test
# Esto modificar√° directamente los archivos en dataset_yolo/labels/...
convert_masks_to_bboxes_inplace(YOLO_LABELS_TRAIN)
convert_masks_to_bboxes_inplace(YOLO_LABELS_TEST)

print("\nüöÄ Conversi√≥n completa: Los archivos de etiquetas ahora contienen Bounding Boxes.")

üîÑ Convirtiendo m√°scaras a Bounding Boxes IN-PLACE en ./dataset_yolo\labels\train...
  ‚úÖ 188 archivos convertidos y actualizados a Bounding Box.
üîÑ Convirtiendo m√°scaras a Bounding Boxes IN-PLACE en ./dataset_yolo\labels\test...
  ‚úÖ 21 archivos convertidos y actualizados a Bounding Box.

üöÄ Conversi√≥n completa: Los archivos de etiquetas ahora contienen Bounding Boxes.
