#***▶ Configuración Inicial y Descarga de Datos***

*   **DESCRIPCIÓN:** Clasificación multi-label de enfermedades en hojas de manzano
*   **TÉCNICAS:** Redes Convolucionales, Transfer Learning, Data Augmentation

In [1]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"duvan0598","key":"dd66d6752e8b1194f147e8c526826508"}'}

In [2]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
# Descargar los datos de la competencia Plant Pathology 2020
!kaggle competitions download -c plant-pathology-2020-fgvc7

Downloading plant-pathology-2020-fgvc7.zip to /content
 94% 735M/779M [00:03<00:00, 155MB/s] 
100% 779M/779M [00:03<00:00, 213MB/s]


In [4]:
# Crear directorios organizados
!mkdir -p /content/plant_raw
!mkdir -p /content/plant_processed
!unzip plant-pathology-2020-fgvc7.zip -d /content/plant_raw

Archive:  plant-pathology-2020-fgvc7.zip
  inflating: /content/plant_raw/images/Test_0.jpg  
  inflating: /content/plant_raw/images/Test_1.jpg  
  inflating: /content/plant_raw/images/Test_10.jpg  
  inflating: /content/plant_raw/images/Test_100.jpg  
  inflating: /content/plant_raw/images/Test_1000.jpg  
  inflating: /content/plant_raw/images/Test_1001.jpg  
  inflating: /content/plant_raw/images/Test_1002.jpg  
  inflating: /content/plant_raw/images/Test_1003.jpg  
  inflating: /content/plant_raw/images/Test_1004.jpg  
  inflating: /content/plant_raw/images/Test_1005.jpg  
  inflating: /content/plant_raw/images/Test_1006.jpg  
  inflating: /content/plant_raw/images/Test_1007.jpg  
  inflating: /content/plant_raw/images/Test_1008.jpg  
  inflating: /content/plant_raw/images/Test_1009.jpg  
  inflating: /content/plant_raw/images/Test_101.jpg  
  inflating: /content/plant_raw/images/Test_1010.jpg  
  inflating: /content/plant_raw/images/Test_1011.jpg  
  inflating: /content/plant_raw/im

In [6]:
from google.colab import files
import pandas as pd
import tensorflow as tf
import numpy as np
from tqdm import tqdm
import cv2
import os
from tensorflow.keras import layers
from tensorflow.keras.applications import EfficientNetB0

#***▶ Análisis Exploratorio de Datos***:

In [7]:
print("=== ANÁLISIS EXPLORATORIO DE DATOS ===")

# Cargar y analizar datos de entrenamiento
df = pd.read_csv("/content/plant_raw/train.csv")
print(f"Total de imágenes: {len(df)}")
print(f"Distribución de clases:")
print(df[['healthy', 'multiple_diseases', 'rust', 'scab']].sum())

# Verificar balance de clases
class_distribution = df[['healthy', 'multiple_diseases', 'rust', 'scab']].sum()
print(f"\nDistribución porcentual:")
print((class_distribution / len(df) * 100).round(2))

=== ANÁLISIS EXPLORATORIO DE DATOS ===
Total de imágenes: 1821
Distribución de clases:
healthy              516
multiple_diseases     91
rust                 622
scab                 592
dtype: int64

Distribución porcentual:
healthy              28.34
multiple_diseases     5.00
rust                 34.16
scab                 32.51
dtype: float64



#***▶ Preprocesamiento de Datos***:

In [8]:
# Configuración global
IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32
AUTOTUNE = tf.data.AUTOTUNE

In [22]:
def analyze_images(df, images_dir):
    """
    Analiza dimensiones y características de las imágenes
    """
    print("="*50)
    print("\t\tANÁLISIS DE IMÁGENES")
    print("="*50)
    sample_img_path = f"{images_dir}/{df.iloc[0]['image_id']}.jpg"
    sample_img = cv2.imread(sample_img_path)
    print(f"Dimensión de imagen sample: {sample_img.shape}")

    # Verificar existencia de todas las imágenes
    missing_images = []
    for img_id in df['image_id']:
        if not os.path.exists(f"{images_dir}/{img_id}.jpg"):
            missing_images.append(img_id)

    if missing_images:
        print(f"⚠️  Imágenes faltantes: {len(missing_images)}")
    else:
        print("✅ Todas las imágenes encontradas")

analyze_images(df, "/content/plant_raw/images")

		ANÁLISIS DE IMÁGENES
Dimensión de imagen sample: (1365, 2048, 3)
✅ Todas las imágenes encontradas


## ***Creación de TFRecords***

In [14]:
def serialize_example(image_bytes, label):
    """
    Serializa imagen y etiquetas para TFRecord
    CORRECCIÓN: No usar .shape en bytes object
    """
    feature = {
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_bytes])),
        'label': tf.train.Feature(float_list=tf.train.FloatList(value=label)),
    }
    proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return proto.SerializeToString()

def create_tfrecords_dataset(df, images_dir, output_path):
    """
    Crea dataset TFRecord a partir del DataFrame - VERSIÓN CORREGIDA
    """
    print("=== CREANDO TFRecord DATASET ===")

    with tf.io.TFRecordWriter(output_path) as writer:
        successful = 0
        errors = []

        for i, row in tqdm(df.iterrows(), total=len(df), desc="Procesando imágenes"):
            try:
                img_path = f"{images_dir}/{row['image_id']}.jpg"

                # Preprocesar imagen
                img = preprocess_image(img_path)
                img_bytes = img.tobytes()

                # Preparar etiquetas (multi-hot encoding)
                labels = df.columns[1:]  # ['healthy', 'multiple_diseases', 'rust', 'scab']
                label_values = row[labels].values.astype(np.float32)

                # Serializar y escribir - CORREGIDO
                example = serialize_example(img_bytes, label_values)
                writer.write(example)
                successful += 1

            except Exception as e:
                errors.append((row['image_id'], str(e)))
                continue

        print(f"✅ Imágenes procesadas exitosamente: {successful}/{len(df)}")
        if errors:
            print(f"⚠️  Errores: {len(errors)}")
            for img_id, error in errors[:5]:
                print(f"   - {img_id}: {error}")

    return successful, errors

# También mejoremos la función de preprocesamiento con más verificaciones
def preprocess_image(img_path, target_size=IMAGE_SIZE):
    """
    Preprocesa imagen: redimensiona, normaliza y convierte a RGB
    """
    # Verificar que el archivo existe
    if not os.path.exists(img_path):
        raise FileNotFoundError(f"Imagen no encontrada: {img_path}")

    img = cv2.imread(img_path)
    if img is None:
        raise ValueError(f"No se pudo decodificar la imagen: {img_path}")

    # Convertir a RGB
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Redimensionar
    img = cv2.resize(img, target_size)

    # Verificar dimensiones
    if img.shape != (*target_size, 3):
        print(f"Advertencia: imagen {os.path.basename(img_path)} tiene shape {img.shape}")

    return img

# Crear directorio de salida si no existe
os.makedirs("/content/plant_processed", exist_ok=True)

# Crear el dataset TFRecord - VERSIÓN CORREGIDA
tfrecord_path = "/content/plant_processed/plant_dataset.tfrecord"
successful, errors = create_tfrecords_dataset(df, "/content/plant_raw/images", tfrecord_path)

=== CREANDO TFRecord DATASET ===


Procesando imágenes: 100%|██████████| 1821/1821 [00:33<00:00, 53.82it/s]

✅ Imágenes procesadas exitosamente: 1821/1821





## ***Pipeline de Datos Optimizado***

In [15]:
def parse_tfrecord(example_proto):
    """
    Parsea ejemplos TFRecord para el entrenamiento
    """
    features = {
        'image': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([4], tf.float32),
    }
    parsed = tf.io.parse_single_example(example_proto, features)

    # Decodificar imagen
    img = tf.io.decode_raw(parsed['image'], tf.uint8)
    img = tf.reshape(img, [*IMAGE_SIZE, 3])
    img = tf.cast(img, tf.float32) / 255.0  # Normalización [0,1]

    return img, parsed['label']

def create_data_pipeline(tfrecord_path, batch_size=32, shuffle_buffer=1000,
                        augmentation=True, is_training=True):
    """
    Crea pipeline de datos optimizado para entrenamiento/validación
    """
    # Data augmentation más robusto
    data_augmentation = tf.keras.Sequential([
        layers.RandomFlip("horizontal_and_vertical"),
        layers.RandomRotation(0.2),
        layers.RandomZoom(0.2),
        layers.RandomContrast(0.2),
        layers.RandomBrightness(0.2),
    ])

    # Cargar dataset
    dataset = tf.data.TFRecordDataset(tfrecord_path)
    dataset = dataset.map(parse_tfrecord, num_parallel_calls=AUTOTUNE)

    if is_training:
        # Solo aplicar shuffle durante entrenamiento
        dataset = dataset.shuffle(shuffle_buffer)

        if augmentation:
            # Aplicar data augmentation
            dataset = dataset.map(
                lambda x, y: (data_augmentation(x, training=True), y),
                num_parallel_calls=AUTOTUNE
            )

    # Optimizaciones finales
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(AUTOTUNE)

    return dataset

# Crear datasets de entrenamiento y validación
print("=== CREANDO DATASETS ===")

# Split entrenamiento/validación (80/20)
train_size = int(0.8 * successful)
val_size = successful - train_size

# En una implementación real, separaríamos los datos
# Por simplicidad, usaremos el mismo dataset
train_ds = create_data_pipeline(tfrecord_path, batch_size=BATCH_SIZE,
                               augmentation=True, is_training=True)

# Para validación, crear sin aumento de datos
val_ds = create_data_pipeline(tfrecord_path, batch_size=BATCH_SIZE,
                             augmentation=False, is_training=False)

print(f"Dataset de entrenamiento listo")
print(f"Dataset de validación listo")

=== CREANDO DATASETS ===
Dataset de entrenamiento listo
Dataset de validación listo


## ***Verificación Final***

In [21]:
# Verificar un batch del dataset
for images, labels in train_ds.take(1):
  print("="*50)
  print(f"\t\tVERIFICACIÓN DATASET")
  print("="*50)
  print(f"Shape de imágenes batch: {images.shape}")
  print(f"Shape de etiquetas batch: {labels.shape}")
  print(f"Rango de píxeles: [{tf.reduce_min(images):.3f}, {tf.reduce_max(images):.3f}]")
  print(f"Ejemplo de etiquetas: {labels[0].numpy()}")
  break

print("\n✅ PREPROCESADO COMPLETADO EXITOSAMENTE")

		VERIFICACIÓN DATASET
Shape de imágenes batch: (32, 224, 224, 3)
Shape de etiquetas batch: (32, 4)
Rango de píxeles: [0.000, 51.916]
Ejemplo de etiquetas: [0. 0. 1. 0.]

✅ PREPROCESADO COMPLETADO EXITOSAMENTE


In [24]:
# =============================================================================
# 1. SEPARACIÓN CORRECTA DE DATOS - ENTRENAMIENTO/VALIDACIÓN
# =============================================================================

print("🎯 SEPARANDO DATOS EN ENTRENAMIENTO Y VALIDACIÓN...")

# Mezclar el DataFrame original para mejor distribución
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Calcular split 80/20
train_size = int(0.8 * len(df_shuffled))
train_df = df_shuffled[:train_size]
val_df = df_shuffled[train_size:]

print(f"✅ Datos separados:")
print(f"   • Entrenamiento: {len(train_df)} imágenes")
print(f"   • Validación: {len(val_df)} imágenes")

# =============================================================================
# 2. CREAR TFRecords SEPARADOS
# =============================================================================

# Crear TFRecord para entrenamiento
train_tfrecord_path = "/content/plant_processed/train_dataset.tfrecord"
print(f"\n💾 Creando TFRecord de entrenamiento...")
train_successful, train_errors = create_tfrecords_dataset(
    train_df, "/content/plant_raw/images", train_tfrecord_path
)

# Crear TFRecord para validación
val_tfrecord_path = "/content/plant_processed/val_dataset.tfrecord"
print(f"💾 Creando TFRecord de validación...")
val_successful, val_errors = create_tfrecords_dataset(
    val_df, "/content/plant_raw/images", val_tfrecord_path
)

# =============================================================================
# 3. CREAR PIPELINES SEPARADOS
# =============================================================================

print("\n🔧 CREANDO PIPELINES SEPARADOS...")

# Pipeline de ENTRENAMIENTO (con data augmentation)
train_ds_final = create_data_pipeline(
    train_tfrecord_path,
    batch_size=BATCH_SIZE,
    shuffle_buffer=1000,
    augmentation=True,
    is_training=True
)

# Pipeline de VALIDACIÓN (sin data augmentation)
val_ds_final = create_data_pipeline(
    val_tfrecord_path,
    batch_size=BATCH_SIZE,
    shuffle_buffer=0,  # No shuffle en validación
    augmentation=False,
    is_training=False
)

print("✅ Pipelines creados:")
print(f"   • Entrenamiento: con data augmentation y shuffle")
print(f"   • Validación: sin augmentation, sin shuffle")

# =============================================================================
# 4. ANÁLISIS DE BALANCE DE CLASES
# =============================================================================

print("\n⚖️ ANALIZANDO BALANCE DE CLASES...")

# Distribución en entrenamiento
train_dist = train_df[['healthy', 'multiple_diseases', 'rust', 'scab']].sum()
print("📊 DISTRIBUCIÓN ENTRENAMIENTO:")
for label in ['healthy', 'multiple_diseases', 'rust', 'scab']:
    count = train_dist[label]
    percentage = (count / len(train_df)) * 100
    print(f"   • {label:20}: {count:3d} imágenes ({percentage:5.1f}%)")

# Distribución en validación
val_dist = val_df[['healthy', 'multiple_diseases', 'rust', 'scab']].sum()
print("\n📊 DISTRIBUCIÓN VALIDACIÓN:")
for label in ['healthy', 'multiple_diseases', 'rust', 'scab']:
    count = val_dist[label]
    percentage = (count / len(val_df)) * 100
    print(f"   • {label:20}: {count:3d} imágenes ({percentage:5.1f}%)")

# Calcular class weights para manejar desbalance
class_weights = {}
total_train = len(train_df)
labels = ['healthy', 'multiple_diseases', 'rust', 'scab']

for i, label in enumerate(labels):
    count = train_dist[label]
    if count > 0:
        weight = total_train / (len(labels) * count)
        class_weights[i] = weight
    else:
        class_weights[i] = 1.0

print(f"\n🎯 CLASS WEIGHTS (para training):")
for i, label in enumerate(labels):
    print(f"   • {label:20}: {class_weights[i]:.3f}")

# =============================================================================
# 5. VERIFICACIÓN FINAL
# =============================================================================

print("\n🔍 VERIFICACIÓN FINAL DE DATASETS...")

# Verificar dataset de entrenamiento
print("📚 DATASET DE ENTRENAMIENTO:")
for images, labels in train_ds_final.take(1):
    print(f"   • Batch shape: {images.shape}")
    print(f"   • Labels shape: {labels.shape}")
    print(f"   • Rango píxeles: [{tf.reduce_min(images):.3f}, {tf.reduce_max(images):.3f}]")
    print(f"   • Ejemplo label: {labels[0].numpy()}")

# Verificar dataset de validación
print("\n📚 DATASET DE VALIDACIÓN:")
for images, labels in val_ds_final.take(1):
    print(f"   • Batch shape: {images.shape}")
    print(f"   • Labels shape: {labels.shape}")
    print(f"   • Rango píxeles: [{tf.reduce_min(images):.3f}, {tf.reduce_max(images):.3f}]")

# =============================================================================
# 6. RESUMEN FINAL
# =============================================================================

print("\n" + "="*60)
print("🎉 PREPROCESADO 100% COMPLETADO")
print("="*60)

print(f"📊 RESUMEN FINAL:")
print(f"   • Total imágenes: {len(df)}")
print(f"   • Entrenamiento: {len(train_df)} imágenes")
print(f"   • Validación: {len(val_df)} imágenes")
print(f"   • TFRecords: ✅ Separados (train/val)")
print(f"   • Data augmentation: ✅ Solo en entrenamiento")
print(f"   • Class weights: ✅ Calculados")
print(f"   • Pipelines: ✅ Optimizados")

print(f"\n🚀 VARIABLES LISTAS PARA EL MODELO:")
print(f"   • train_ds_final: Dataset de entrenamiento")
print(f"   • val_ds_final: Dataset de validación")
print(f"   • class_weights: Pesos para clases desbalanceadas")

🎯 SEPARANDO DATOS EN ENTRENAMIENTO Y VALIDACIÓN...
✅ Datos separados:
   • Entrenamiento: 1456 imágenes
   • Validación: 365 imágenes

💾 Creando TFRecord de entrenamiento...
=== CREANDO TFRecord DATASET ===


Procesando imágenes: 100%|██████████| 1456/1456 [00:15<00:00, 94.32it/s]


✅ Imágenes procesadas exitosamente: 1456/1456
💾 Creando TFRecord de validación...
=== CREANDO TFRecord DATASET ===


Procesando imágenes: 100%|██████████| 365/365 [00:03<00:00, 96.74it/s]


✅ Imágenes procesadas exitosamente: 365/365

🔧 CREANDO PIPELINES SEPARADOS...
✅ Pipelines creados:
   • Entrenamiento: con data augmentation y shuffle
   • Validación: sin augmentation, sin shuffle

⚖️ ANALIZANDO BALANCE DE CLASES...
📊 DISTRIBUCIÓN ENTRENAMIENTO:
   • healthy             : 405 imágenes ( 27.8%)
   • multiple_diseases   :  71 imágenes (  4.9%)
   • rust                : 505 imágenes ( 34.7%)
   • scab                : 475 imágenes ( 32.6%)

📊 DISTRIBUCIÓN VALIDACIÓN:
   • healthy             : 111 imágenes ( 30.4%)
   • multiple_diseases   :  20 imágenes (  5.5%)
   • rust                : 117 imágenes ( 32.1%)
   • scab                : 117 imágenes ( 32.1%)

🎯 CLASS WEIGHTS (para training):
   • healthy             : 0.899
   • multiple_diseases   : 5.127
   • rust                : 0.721
   • scab                : 0.766

🔍 VERIFICACIÓN FINAL DE DATASETS...
📚 DATASET DE ENTRENAMIENTO:
   • Batch shape: (32, 224, 224, 3)
   • Labels shape: (32, 4)
   • Rango píxeles: [0