# üêÑ Sistema de Estimaci√≥n de Peso Bovino - Setup ML

**Proyecto**: Hacienda Gamelera - Bruno Brito Macedo  
**Responsable**: Persona 2 - Setup Infraestructura ML  
**Objetivo**: Preparar datasets y pipeline para entrenamiento de 7 modelos por raza  
**Duraci√≥n**: 5-6 d√≠as  

---

## üìã Checklist de Tareas
- [x] D√≠a 1: Setup Google Colab Pro + dependencias
- [ ] D√≠a 2-3: Descargar y organizar datasets cr√≠ticos
- [ ] D√≠a 4: An√°lisis exploratorio de datos (EDA)
- [ ] D√≠a 5-6: Preparar pipeline de datos optimizado

## üéØ Razas Objetivo (7 razas)
1. **Brahman** - Bos indicus robusto
2. **Nelore** - Bos indicus
3. **Angus** - Bos taurus, buena carne
4. **Cebuinas** - Bos indicus general
5. **Criollo** - Adaptado local
6. **Pardo Suizo** - Bos taurus grande
7. **Jersey** - Lechera, menor tama√±o


In [None]:
# ============================================================
# üìÅ CONFIGURAR RUTA DEL PROYECTO
# ============================================================

import sys
from pathlib import Path

# ‚úÖ INDICA d√≥nde vive el repositorio dentro del runtime de Colab.
#    - Si acabas de clonar el repo:   BASE_DIR = Path('/content/bovine-weight-estimation')
#    - Si lo tienes en Google Drive: BASE_DIR = Path('/content/drive/MyDrive/<carpeta>')
BASE_DIR = Path('/content/bovine-weight-estimation')
# BASE_DIR = Path('/content/drive/MyDrive/bovine-weight-estimation')  # <--- Descomenta si usas Drive

# A√±adimos la carpeta src al PYTHONPATH para que todos los m√≥dulos internos sean importables.
ML_TRAINING_DIR = BASE_DIR / 'ml-training'
sys.path.insert(0, str(ML_TRAINING_DIR / 'src'))

# Validamos que la estructura del proyecto exista antes de continuar.
if ML_TRAINING_DIR.exists():
    print(f"‚úÖ Proyecto encontrado en: {ML_TRAINING_DIR}")
    print("üìÇ Subcarpetas clave detectadas:")
    print(f"   - C√≥digo fuente: {ML_TRAINING_DIR / 'src'}")
    print(f"   - Scripts utilitarios: {ML_TRAINING_DIR / 'scripts'}")
    print(f"   - Configuraci√≥n: {ML_TRAINING_DIR / 'config'}")
else:
    print(f"‚ö†Ô∏è No se encontr√≥ el proyecto en: {ML_TRAINING_DIR}")
    print("üí° Ajusta BASE_DIR o revisa que clonaste/montaste el repositorio correctamente.")


In [None]:
# ============================================================
# ‚úÖ IMPORTAR M√ìDULOS DEL PROYECTO
# ============================================================

# Data Augmentation
from data.augmentation import get_training_transform, get_aggressive_augmentation, get_validation_transform

# Modelos
from models.cnn_architecture import BreedWeightEstimatorCNN, BREED_CONFIGS

# Evaluaci√≥n
from models.evaluation.metrics import MetricsCalculator, ModelMetrics

# Exportaci√≥n TFLite
from models.export.tflite_converter import TFLiteExporter

print("‚úÖ Todos los m√≥dulos importados correctamente")
print("\nüì¶ M√≥dulos disponibles:")
print("   - Data augmentation (Albumentations 2.0.8)")
print("   - CNN architectures (MobileNetV2, EfficientNet)")
print("   - Metrics calculator (R¬≤, MAE, MAPE)")
print("   - TFLite exporter (optimizado para m√≥vil)")


In [None]:
# ============================================================
# üéì EJEMPLO: CREAR MODELO PARA UNA RAZA
# ============================================================

# Ejemplo 1: Crear modelo para Brahman
model_brahman = BreedWeightEstimatorCNN.build_model(
    breed_name='brahman',
    base_architecture='mobilenetv2'  # M√°s r√°pido que EfficientNet
)

print(f"‚úÖ Modelo creado: {model_brahman.name}")
print(f"üìä Par√°metros: {model_brahman.count_params():,}")

# Ver arquitectura
print("\nüìê Arquitectura del modelo:")
model_brahman.summary()


---

## üìù Pr√≥ximos Pasos

1. **Descargar datasets** (CID, CattleEyeView, etc.)
2. **Preprocesar datos** con nuestros m√≥dulos
3. **Entrenar modelo base** gen√©rico
4. **Fine-tuning por raza** (5 razas)
5. **Recolecci√≥n propia** (Criollo, Pardo Suizo)
6. **Exportar a TFLite** e integrar en app m√≥vil

> Ver `README.md` y `scripts/train_all_breeds.py` para m√°s ejemplos.



## üöÄ D√≠a 1: Setup Google Colab Pro + Dependencias

In [None]:
# ============================================================
# üîß INSTALACI√ìN DE DEPENDENCIAS - CONFIGURACI√ìN ESTABLE (Colab 2025)
# ============================================================

# ‚ö†Ô∏è Ejecuta esta celda SOLO una vez tras abrir el notebook. Las versiones fijadas
#    son compatibles con Python 3.10 y con la GPU T4 de Colab (TensorFlow 2.17).
#    Si ya instalaste dependencias, puedes omitirla para evitar reinstalaciones.
!pip install -q --upgrade pip
!pip install -q tensorflow==2.17.0 tensorflow-hub tensorflow-datasets
!pip install -q albumentations==2.0.8 opencv-python-headless==4.10.0.84
!pip install -q kaggle gdown mlflow==2.14.1 dvc[gs,s3]==3.51.1 plotly seaborn
!pip install -q numpy==1.26.4 pillow==11.0.0 pyarrow==15.0.2 packaging==24.2 google-images-download==2.8.0 scikit-learn==1.3.2
!pip install -q protobuf==4.25.3

import tensorflow as tf
from tensorflow.keras import mixed_precision

# Activamos mixed precision (FP16 en GPU) para acelerar el entrenamiento.
mixed_precision.set_global_policy('mixed_float16')

print("‚úÖ TensorFlow:", tf.__version__)
print("‚úÖ GPU detectada:", tf.config.list_physical_devices('GPU'))

gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
        print("üéÆ GPU lista para entrenamiento.")
    except RuntimeError as e:
        print("‚ö†Ô∏è Error configurando GPU:", e)
else:
    print("‚ö†Ô∏è No se detect√≥ GPU. Activa GPU desde Entorno de ejecuci√≥n > Cambiar tipo de entorno.")


In [None]:
# ============================================================
# ‚úÖ FIX FINAL COMPATIBLE - Albumentations 2.0.8 (Colab 2025)
# ============================================================

# ‚ö†Ô∏è Solo si Colab instal√≥ autom√°ticamente versiones incompatibles. Esta celda garantiza
#    que Albumentations y OpenCV usen la pareja estable para Python 3.10.
!pip install -q --upgrade pip
!pip uninstall -y albumentations albucore
!pip install -q albumentations==2.0.8 opencv-python-headless==4.10.0.84

import albumentations as A
import cv2

print("‚úÖ Albumentations instalado correctamente:", A.__version__)
print("‚úÖ OpenCV:", cv2.__version__)




In [None]:
# ============================================================
# IMPORTS Y CONFIGURACI√ìN
# ============================================================

# üîç Conjunto completo de librer√≠as usadas en el pipeline: utilidades del sistema,
#    ciencia de datos, visualizaci√≥n, ML y tracking de experimentos.
import os
import sys
import shutil
import subprocess
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
import json
import requests
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import r2_score

# TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers, callbacks
from tensorflow.keras.applications import EfficientNetB0, MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# MLflow para tracking reproducible de experimentos.
import mlflow
import mlflow.tensorflow

# Configurar matplotlib para que todas las gr√°ficas se vean consistentes en Colab.
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("‚úÖ Todas las dependencias importadas correctamente")
print(f"üìä Versiones: TF={tf.__version__}, CV2={cv2.__version__}, Albumentations={A.__version__}")


In [None]:
# ============================================================
# ‚öôÔ∏è CONFIGURACI√ìN DEL PROYECTO (bovine-weight-estimation)
# ============================================================

from pathlib import Path
import mlflow
from google.colab import drive

# üîó Montamos Google Drive solo si no est√° disponible. As√≠ persistimos datasets y modelos.
if not Path('/content/drive').exists() or not any(Path('/content/drive').iterdir()):
    drive.mount('/content/drive')
else:
    print('‚ÑπÔ∏è Google Drive ya est√° montado.')

# üìÅ Directorio base dentro de tu Drive donde se almacenar√° todo el entrenamiento.
BASE_DIR = Path('/content/drive/MyDrive/bovine-weight-estimation')

# üìÇ Creamos (si no existen) las carpetas est√°ndar para datos crudos, procesados y modelos.
DATA_DIR = BASE_DIR / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
AUGMENTED_DIR = DATA_DIR / 'augmented'
MODELS_DIR = BASE_DIR / 'models'
MLRUNS_DIR = BASE_DIR / 'mlruns'

for dir_path in [DATA_DIR, RAW_DIR, PROCESSED_DIR, AUGMENTED_DIR, MODELS_DIR, MLRUNS_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

# ------------------------------------------------------------
# üìä Configuraci√≥n de MLflow (tracking local persistente)
# ------------------------------------------------------------
mlflow.set_tracking_uri(f"file://{MLRUNS_DIR}")
mlflow.set_experiment("bovine-weight-estimation")

# ------------------------------------------------------------
# ‚öôÔ∏è Configuraci√≥n general del entrenamiento (hiperpar√°metros base)
# ------------------------------------------------------------
CONFIG = {
    'image_size': (224, 224),
    'batch_size': 32,
    'epochs': 100,
    'learning_rate': 0.001,
    'validation_split': 0.2,
    'test_split': 0.1,
    'early_stopping_patience': 10,
    'target_r2': 0.95,
    'max_mae': 5.0,
    'max_inference_time': 3.0
}

# ------------------------------------------------------------
# üêÑ Razas objetivo (Santa Cruz, Chiquitan√≠a y Pampa)
# ------------------------------------------------------------
BREEDS = [
    'brahman', 'nelore', 'angus', 'cebuinas',
    'criollo', 'pardo_suizo', 'guzerat', 'holstein'
]

print("‚úÖ Configuraci√≥n completada correctamente")
print(f"üìÅ Directorio base: {BASE_DIR}")
print(f"üéØ Razas objetivo: {len(BREEDS)} razas -> {BREEDS}")
print(f"üìä MLflow tracking: {MLRUNS_DIR}")


## üì• D√≠a 2-3: Descargar y Organizar Datasets Cr√≠ticos


In [None]:
# ============================================================
# 1. CID DATASET (17,899 im√°genes) - M√ÅS IMPORTANTE
# ============================================================

CID_DATASET_ARCHIVE_PATH = os.environ.get('CID_DATASET_ARCHIVE_PATH')


def download_cid_dataset(archive_path: str | None = CID_DATASET_ARCHIVE_PATH) -> Path:
    """Prepara el CID Dataset desde un archivo previamente descargado.

    Requisitos antes de ejecutar:
    1. Sube el archivo comprimido real (zip/tar) al directorio definido en BASE_DIR o /content.
    2. Establece la variable de entorno CID_DATASET_ARCHIVE_PATH apuntando a ese archivo.

    No se generan datos sint√©ticos: si falta el archivo, se detendr√° con un error.
    """
    cid_dir = RAW_DIR / 'cid'
    cid_dir.mkdir(parents=True, exist_ok=True)

    if any(cid_dir.iterdir()):
        print(f"‚ÑπÔ∏è CID Dataset ya est√° disponible en {cid_dir}. Se omite extracci√≥n.")
        return cid_dir

    if archive_path is None:
        raise RuntimeError(
            "Configura la variable de entorno CID_DATASET_ARCHIVE_PATH con la ruta del "
            "archivo comprimido del CID Dataset (por ejemplo .zip o .tar.gz) antes de ejecutar esta celda."
        )

    archive_path = Path(archive_path)
    if not archive_path.exists():
        raise FileNotFoundError(
            f"No se encontr√≥ el archivo comprimido del CID Dataset en {archive_path}. "
            "Sube el dataset real a tu Google Drive y vuelve a ejecutar."
        )

    print(f"üì• Extrayendo CID Dataset desde: {archive_path}")
    try:
        shutil.unpack_archive(str(archive_path), str(cid_dir))
    except shutil.ReadError as exc:
        raise RuntimeError(
            "No se pudo desempaquetar el CID Dataset. Verifica que el archivo est√© en un formato soportado "
            "(.zip, .tar, .tar.gz, .tar.bz2, etc.)."
        ) from exc

    if not any(cid_dir.iterdir()):
        raise RuntimeError(
            "La extracci√≥n del CID Dataset no produjo archivos. Verifica que el archivo comprimido contenga datos v√°lidos."
        )

    print(f"‚úÖ CID Dataset preparado en: {cid_dir}")
    return cid_dir


# Ejecutar preparaci√≥n (requerir√° archivo real previamente cargado)
cid_dataset_path = download_cid_dataset()


In [None]:
# ============================================================
# 2. KAGGLE CATTLE WEIGHT DATASET (12k im√°genes)
# ============================================================

KAGGLE_DATASET_ID = os.environ.get(
    'KAGGLE_DATASET_ID', 'sadhliroomyprime/cattle-weight-detection-model-dataset-12k'
)


def setup_kaggle_api() -> Path:
    """Configura la API de Kaggle para descargas reales."""
    print("üîë Configurando API de Kaggle...")

    kaggle_dir = Path('/root/.kaggle')
    kaggle_dir.mkdir(exist_ok=True)

    kaggle_json = kaggle_dir / 'kaggle.json'
    if not kaggle_json.exists():
        raise FileNotFoundError(
            "No se encontr√≥ /root/.kaggle/kaggle.json. Descarga tu token desde "
            "https://www.kaggle.com/account, s√∫belo al notebook y vuelve a ejecutar."
        )

    subprocess.run(["chmod", "600", "/root/.kaggle/kaggle.json"], check=True)
    return kaggle_dir


def download_kaggle_dataset(dataset_id: str = KAGGLE_DATASET_ID) -> Path:
    """Descarga el dataset de Kaggle indicado.

    Requisitos:
    - Subir `kaggle.json` (token API) a este notebook y colocarlo en /root/.kaggle/
    - Definir KAGGLE_DATASET_ID si deseas descargar un dataset distinto al preset.
    """
    if not dataset_id:
        raise RuntimeError("Define la variable de entorno KAGGLE_DATASET_ID con el dataset a descargar.")

    kaggle_dir = setup_kaggle_api()
    output_dir = RAW_DIR / 'kaggle'
    output_dir.mkdir(parents=True, exist_ok=True)

    if any(output_dir.glob('**/*')):
        print(f"‚ÑπÔ∏è Dataset de Kaggle ya presente en {output_dir}. Se omite descarga.")
        return output_dir

    print(f"üì• Descargando dataset de Kaggle: {dataset_id}")
    subprocess.run([
        "kaggle",
        "datasets",
        "download",
        "-d",
        dataset_id,
        "-p",
        str(output_dir),
    ], check=True)

    archive_files = list(output_dir.glob('*.zip'))
    if not archive_files:
        raise RuntimeError("La descarga de Kaggle no produjo archivos .zip. Verifica el ID del dataset.")

    for archive_file in archive_files:
        print(f"üì¶ Descomprimiendo {archive_file.name}")
        subprocess.run([
            "unzip",
            "-q",
            str(archive_file),
            "-d",
            str(output_dir),
        ], check=True)
        archive_file.unlink()

    if not any(output_dir.glob('**/*')):
        raise RuntimeError("La extracci√≥n del dataset de Kaggle no produjo archivos. Revisa el contenido descargado.")

    print(f"‚úÖ Kaggle dataset disponible en: {output_dir}")
    return output_dir


# Ejecutar descarga (requiere credenciales reales)
kaggle_dataset_path = download_kaggle_dataset()


In [None]:
# ============================================================
# 3. GOOGLE IMAGES SCRAPING PARA RAZAS LOCALES
# ============================================================

def scrape_google_images():
    """Scraping de Google Images para razas locales.

    Uso opcional para complementar razas poco representadas. Respeta los t√©rminos de uso
    del motor de b√∫squeda y evita ejecutar m√∫ltiples veces para no ser bloqueado.
    """
    print("üñºÔ∏è Scraping Google Images para razas locales...")
    
    from google_images_download import google_images_download
    
    # Razas locales espec√≠ficas
    breeds_local = [
        'ganado criollo boliviano',
        'guzerat bolivia', 
        'brahman chiquitania',
        'nelore pantanal',
        'angus bolivia',
        'pardo suizo bolivia',
        'jersey bolivia'
    ]
    
    response = google_images_download.googleimagesdownload()
    
    scraped_count = 0
    
    for breed in breeds_local:
        try:
            print(f"üì∏ Scraping: {breed}")
            
            # Configuraci√≥n de descarga
            arguments = {
                "keywords": breed,
                "limit": 50,  # L√≠mite por t√©rmino
                "print_urls": False,
                "output_directory": str(RAW_DIR / 'scraped'),
                "image_directory": breed.replace(' ', '_'),
                "format": "jpg",
                "size": "medium",
                "aspect_ratio": "wide"
            }
            
            # Descargar im√°genes
            paths = response.download(arguments)
            
            if paths:
                count = len(paths[0])
                scraped_count += count
                print(f"‚úÖ {breed}: {count} im√°genes descargadas")
            
        except Exception as e:
            print(f"‚ö†Ô∏è Error con {breed}: {e}")
            continue
    
    print(f"üéØ Total im√°genes scraped: {scraped_count}")
    return scraped_count

# Ejecutar scraping
scraped_images = scrape_google_images()


In [None]:
# ============================================================
# RESUMEN DE DATASETS DESCARGADOS
# ============================================================

def summarize_datasets(cid_df: pd.DataFrame | None = None) -> pd.DataFrame:
    """Resumen de todos los datasets disponibles (solo datos reales)."""
    print("üìä RESUMEN DE DATASETS")
    print("=" * 50)

    datasets_info = []

    if cid_df is not None:
        datasets_info.append({
            'name': 'CID Dataset',
            'images': len(cid_df),
            'description': 'Computer Vision Research - Cattle Image Database',
            'status': '‚úÖ Disponible',
        })
    else:
        datasets_info.append({
            'name': 'CID Dataset',
            'images': 0,
            'description': 'CID sin metadata cargada',
            'status': '‚ö†Ô∏è Pendiente',
        })

    if kaggle_dataset_path and kaggle_dataset_path.exists():
        kaggle_images = len(list(kaggle_dataset_path.glob('**/*.jpg')))
        datasets_info.append({
            'name': 'Kaggle Cattle Weight',
            'images': kaggle_images,
            'description': f'Dataset Kaggle ({KAGGLE_DATASET_ID})',
            'status': '‚úÖ Disponible' if kaggle_images > 0 else '‚ö†Ô∏è Vac√≠o',
        })
    else:
        datasets_info.append({
            'name': 'Kaggle Cattle Weight',
            'images': 0,
            'description': 'Requiere configuraci√≥n de API Kaggle',
            'status': '‚ö†Ô∏è Pendiente',
        })

    datasets_info.append({
        'name': 'Google Images Scraped',
        'images': scraped_images,
        'description': 'Razas locales bolivianas',
        'status': '‚úÖ Disponible' if scraped_images > 0 else '‚ö†Ô∏è Pendiente',
    })

    df_datasets = pd.DataFrame(datasets_info)
    print(df_datasets.to_string(index=False))

    total_images = int(df_datasets['images'].sum())
    print(f"\nüéØ TOTAL IM√ÅGENES DISPONIBLES: {total_images:,}")

    summary_path = DATA_DIR / 'datasets_summary.csv'
    df_datasets.to_csv(summary_path, index=False)
    print(f"\nüíæ Resumen guardado en: {summary_path}")

    return df_datasets

# Ejecutar resumen con datos reales cargados
datasets_summary = summarize_datasets(df_cid)


## üìä D√≠a 4: An√°lisis Exploratorio de Datos (EDA)


In [None]:
# ============================================================
# AN√ÅLISIS EXPLORATORIO - CID DATASET
# ============================================================

CID_METADATA_FILE = Path(os.environ.get('CID_METADATA_FILE', cid_dataset_path / 'metadata.csv'))


def analyze_cid_dataset(metadata_file: Path) -> pd.DataFrame:
    """An√°lisis exploratorio utilizando datos reales del CID Dataset."""
    if not metadata_file.exists():
        raise FileNotFoundError(
            f"No se encontr√≥ el archivo de metadata del CID Dataset en {metadata_file}. "
            "Genera o coloca un CSV con las columnas ['image_path', 'weight_kg', 'breed', 'age_category', 'image_quality', 'lighting', 'angle']."
        )

    df_cid = pd.read_csv(metadata_file)

    required_columns = {
        'image_path',
        'weight_kg',
        'breed',
        'age_category',
        'image_quality',
        'lighting',
        'angle',
    }
    missing_columns = required_columns.difference(df_cid.columns)
    if missing_columns:
        raise ValueError(
            f"La metadata del CID Dataset no contiene las columnas requeridas: {sorted(missing_columns)}"
        )

    print("üìä AN√ÅLISIS EXPLORATORIO - CID DATASET")
    print("=" * 60)
    print(f"üìà Total im√°genes: {len(df_cid):,}")
    print(f"üìä Dimensiones: {df_cid.shape}")

    print("\nüìã Columnas disponibles:")
    for col in df_cid.columns:
        print(f"  - {col}")

    print("\n‚öñÔ∏è DISTRIBUCI√ìN DE PESO:")
    print(df_cid['weight_kg'].describe())

    print("\nüêÑ DISTRIBUCI√ìN POR RAZA:")
    print(df_cid['breed'].value_counts())

    print("\nüì∏ CALIDAD DE IM√ÅGENES:")
    print(df_cid['image_quality'].value_counts())

    return df_cid


# Ejecutar an√°lisis (requiere metadata real)
df_cid = analyze_cid_dataset(CID_METADATA_FILE)


In [None]:
# ============================================================
# VISUALIZACIONES EDA
# ============================================================

def create_eda_visualizations(df):
    """Crear visualizaciones completas del EDA"""
    print("üìä Creando visualizaciones EDA...")
    
    # Configurar subplots
    fig = make_subplots(
        rows=3, cols=2,
        subplot_titles=(
            'Distribuci√≥n de Peso', 'Peso por Raza',
            'Distribuci√≥n por Edad', 'Calidad de Im√°genes',
            'Peso vs Iluminaci√≥n', 'Peso vs √Ångulo'
        ),
        specs=[[{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"secondary_y": False}]]
    )
    
    # 1. Distribuci√≥n de peso
    fig.add_trace(
        go.Histogram(x=df['weight_kg'], nbinsx=50, name='Peso (kg)',
                    marker_color='lightblue', opacity=0.7),
        row=1, col=1
    )
    
    # 2. Peso por raza
    for breed in df['breed'].unique():
        breed_data = df[df['breed'] == breed]['weight_kg']
        fig.add_trace(
            go.Box(y=breed_data, name=breed, boxpoints='outliers'),
            row=1, col=2
        )
    
    # 3. Distribuci√≥n por edad
    age_counts = df['age_category'].value_counts()
    fig.add_trace(
        go.Bar(x=age_counts.index, y=age_counts.values, name='Categor√≠as de Edad',
               marker_color='lightgreen'),
        row=2, col=1
    )
    
    # 4. Calidad de im√°genes
    quality_counts = df['image_quality'].value_counts()
    fig.add_trace(
        go.Pie(labels=quality_counts.index, values=quality_counts.values,
               name='Calidad'),
        row=2, col=2
    )
    
    # 5. Peso vs Iluminaci√≥n
    for lighting in df['lighting'].unique():
        lighting_data = df[df['lighting'] == lighting]['weight_kg']
        fig.add_trace(
            go.Box(y=lighting_data, name=lighting),
            row=3, col=1
        )
    
    # 6. Peso vs √Ångulo
    for angle in df['angle'].unique():
        angle_data = df[df['angle'] == angle]['weight_kg']
        fig.add_trace(
            go.Box(y=angle_data, name=angle),
            row=3, col=2
        )
    
    # Configurar layout
    fig.update_layout(
        height=1200,
        title_text="An√°lisis Exploratorio - CID Dataset",
        title_x=0.5,
        showlegend=True
    )
    
    # Mostrar gr√°fico
    fig.show()
    
    # Guardar gr√°fico
    fig.write_html(DATA_DIR / 'eda_visualizations.html')
    print(f"üíæ Visualizaciones guardadas en: {DATA_DIR / 'eda_visualizations.html'}")
    
    return fig

# Ejecutar visualizaciones
eda_fig = create_eda_visualizations(df_cid)


In [None]:
# ============================================================
# AN√ÅLISIS ESPEC√çFICO POR RAZA
# ============================================================

def analyze_breeds_for_training(df):
    """Analizar qu√© razas est√°n bien representadas para entrenamiento"""
    print("üêÑ AN√ÅLISIS POR RAZA PARA ENTRENAMIENTO")
    print("=" * 50)
    
    # Razas objetivo del proyecto
    target_breeds = ['brahman', 'nelore', 'angus', 'cebuinas', 'criollo', 'pardo_suizo', 'jersey']
    
    breed_analysis = []
    
    for breed in target_breeds:
        # Buscar razas similares en el dataset
        if breed in df['breed'].values:
            breed_data = df[df['breed'] == breed]
            count = len(breed_data)
            avg_weight = breed_data['weight_kg'].mean()
            std_weight = breed_data['weight_kg'].std()
            
            status = "‚úÖ Suficiente" if count >= 1000 else "‚ö†Ô∏è Limitado" if count >= 100 else "‚ùå Insuficiente"
            
        else:
            # Buscar razas similares
            similar_breeds = []
            if breed in ['brahman', 'nelore', 'cebuinas']:
                similar_breeds = ['mixed']  # Bos indicus
            elif breed in ['angus']:
                similar_breeds = ['mixed']  # Bos taurus
            
            count = sum(len(df[df['breed'] == sb]) for sb in similar_breeds)
            avg_weight = df[df['breed'].isin(similar_breeds)]['weight_kg'].mean() if similar_breeds else 0
            std_weight = df[df['breed'].isin(similar_breeds)]['weight_kg'].std() if similar_breeds else 0
            
            status = "üîÑ Transfer Learning" if count >= 1000 else "‚ùå Recolecci√≥n requerida"
        
        breed_analysis.append({
            'breed': breed,
            'images_available': count,
            'avg_weight_kg': round(avg_weight, 1),
            'std_weight_kg': round(std_weight, 1),
            'status': status,
            'strategy': 'Direct training' if count >= 1000 else 'Transfer learning' if count >= 100 else 'Data collection'
        })
    
    # Crear DataFrame
    df_breed_analysis = pd.DataFrame(breed_analysis)
    
    # Mostrar tabla
    print(df_breed_analysis.to_string(index=False))
    
    # Guardar an√°lisis
    df_breed_analysis.to_csv(DATA_DIR / 'breed_analysis.csv', index=False)
    print(f"\nüíæ An√°lisis por raza guardado en: {DATA_DIR / 'breed_analysis.csv'}")
    
    # Recomendaciones
    print(f"\nüéØ RECOMENDACIONES:")
    
    sufficient_breeds = df_breed_analysis[df_breed_analysis['images_available'] >= 1000]
    if len(sufficient_breeds) > 0:
        print(f"‚úÖ Entrenamiento directo: {', '.join(sufficient_breeds['breed'].tolist())}")
    
    transfer_breeds = df_breed_analysis[(df_breed_analysis['images_available'] >= 100) & (df_breed_analysis['images_available'] < 1000)]
    if len(transfer_breeds) > 0:
        print(f"üîÑ Transfer learning: {', '.join(transfer_breeds['breed'].tolist())}")
    
    collection_breeds = df_breed_analysis[df_breed_analysis['images_available'] < 100]
    if len(collection_breeds) > 0:
        print(f"üì∏ Recolecci√≥n requerida: {', '.join(collection_breeds['breed'].tolist())}")
    
    return df_breed_analysis

# Ejecutar an√°lisis por raza
breed_analysis = analyze_breeds_for_training(df_cid)


## üîß D√≠a 5-6: Preparar Pipeline de Datos


In [None]:
# ============================================================
# PIPELINE DE DATOS OPTIMIZADO
# ============================================================

class CattleDataPipeline:
    """Pipeline de datos para entrenamiento de modelos de estimaci√≥n de peso"""
    
    def __init__(self, data_dir, breeds_mapping=None):
        self.data_dir = Path(data_dir)
        self.breeds_mapping = breeds_mapping or {}
        
        # Augmentation agresivo para datasets peque√±os
        self.augmentation = A.Compose([
            # Variaciones de iluminaci√≥n
            A.RandomBrightnessContrast(brightness_limit=0.3, contrast_limit=0.3, p=0.6),
            A.HueSaturationValue(hue_shift_limit=15, sat_shift_limit=25, p=0.5),
            
            # Ruido y desenfoque
            A.GaussNoise(var_limit=(5, 15), p=0.3),
            A.Blur(blur_limit=3, p=0.25),
            
            # Efectos atmosf√©ricos
            A.RandomShadow(shadow_roi=(0, 0.5, 1, 1), p=0.4),
            A.RandomFog(fog_coef_lower=0.1, fog_coef_upper=0.3, p=0.2),
            
            # Transformaciones geom√©tricas
            A.RandomRotate90(p=0.3),
            A.HorizontalFlip(p=0.5),
            A.ShiftScaleRotate(
                shift_limit=0.1, scale_limit=0.15, 
                rotate_limit=15, border_mode=cv2.BORDER_REFLECT, p=0.5
            ),
            
            # Augmentation espec√≠fico para ganado
            A.RandomCrop(height=200, width=200, p=0.3),  # Simular diferentes distancias
            A.ElasticTransform(alpha=1, sigma=50, p=0.2),  # Deformaciones naturales
            A.GridDistortion(num_steps=5, distort_limit=0.3, p=0.2),
        ])
        
        print(f"‚úÖ Pipeline inicializado para: {self.data_dir}")
    
    def load_and_preprocess(self, img_path: Path, weight: float) -> tuple[np.ndarray, float]:
        """Carga imagen, aplica augmentation y retorna tensores listos para el modelo."""
        if not img_path.exists():
            raise FileNotFoundError(f"Imagen no encontrada: {img_path}")

        img = cv2.imread(str(img_path))
        if img is None:
            raise ValueError(f"No se pudo cargar la imagen: {img_path}")

        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        augmented = self.augmentation(image=img)
        img = augmented['image']

        img = cv2.resize(img, CONFIG['image_size'])
        img = img.astype(np.float32) / 255.0

        return img, float(weight)

    def create_tf_dataset(self, df, split='train'):
        """Crea un tf.data.Dataset a partir de rutas reales."""
        print(f"üîß Creando dataset TensorFlow para split: {split}")

        required_columns = {'image_path', 'weight_kg'}
        missing_columns = required_columns.difference(df.columns)
        if missing_columns:
            raise ValueError(
                f"El DataFrame para el split '{split}' no contiene las columnas requeridas: {sorted(missing_columns)}"
            )

        def data_generator():
            for _, row in df.iterrows():
                raw_path = Path(row['image_path'])
                img_path = raw_path if raw_path.is_absolute() else self.data_dir / raw_path

                img, weight = self.load_and_preprocess(img_path, row['weight_kg'])
                yield img, weight

        dataset = tf.data.Dataset.from_generator(
            data_generator,
            output_signature=(
                tf.TensorSpec(shape=CONFIG['image_size'] + (3,), dtype=tf.float32),
                tf.TensorSpec(shape=(), dtype=tf.float32),
            ),
        )

        dataset = dataset.cache()

        if split == 'train':
            dataset = dataset.shuffle(1000)

        dataset = dataset.batch(CONFIG['batch_size'])
        dataset = dataset.prefetch(tf.data.AUTOTUNE)

        print(f"‚úÖ Dataset {split} creado con optimizaciones")
        return dataset
    
    def split_data(self, df):
        """Divide datos en train/val/test"""
        print("üìä Dividiendo datos en train/val/test...")
        
        # Shuffle datos
        df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
        
        # Calcular splits
        n_total = len(df_shuffled)
        n_train = int(n_total * (1 - CONFIG['validation_split'] - CONFIG['test_split']))
        n_val = int(n_total * CONFIG['validation_split'])
        
        # Dividir
        df_train = df_shuffled[:n_train]
        df_val = df_shuffled[n_train:n_train + n_val]
        df_test = df_shuffled[n_train + n_val:]
        
        print(f"üìà Train: {len(df_train):,} ({len(df_train)/n_total*100:.1f}%)")
        print(f"üìà Val: {len(df_val):,} ({len(df_val)/n_total*100:.1f}%)")
        print(f"üìà Test: {len(df_test):,} ({len(df_test)/n_total*100:.1f}%)")
        
        return df_train, df_val, df_test

# Crear pipeline
pipeline = CattleDataPipeline(RAW_DIR)

# Dividir datos
df_train, df_val, df_test = pipeline.split_data(df_cid)

# Crear datasets TensorFlow
train_dataset = pipeline.create_tf_dataset(df_train, 'train')
val_dataset = pipeline.create_tf_dataset(df_val, 'val')
test_dataset = pipeline.create_tf_dataset(df_test, 'test')


In [None]:
# ============================================================
# ARQUITECTURA DEL MODELO
# ============================================================

def create_weight_estimation_model():
    """Crear modelo para estimaci√≥n de peso"""
    print("üèóÔ∏è Creando arquitectura del modelo...")
    
    # Base model con transfer learning
    base_model = EfficientNetB0(
        weights='imagenet',
        include_top=False,
        input_shape=CONFIG['image_size'] + (3,)
    )
    
    # Congelar capas iniciales
    base_model.trainable = False
    
    # Custom head para regresi√≥n
    x = base_model.output
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dense(256, activation='relu', name='dense_1')(x)
    x = layers.Dropout(0.3)(x)
    x = layers.Dense(128, activation='relu', name='dense_2')(x)
    x = layers.Dropout(0.2)(x)
    
    # Salida: peso estimado en kg
    output = layers.Dense(1, activation='linear', name='weight_output')(x)
    
    # Crear modelo
    model = models.Model(inputs=base_model.input, outputs=output)
    
    # Compilar modelo
    model.compile(
        optimizer=optimizers.Adam(learning_rate=CONFIG['learning_rate']),
        loss='mse',
        metrics=['mae', 'mse']
    )
    
    print(f"‚úÖ Modelo creado con {model.count_params():,} par√°metros")
    print(f"üìä Arquitectura: EfficientNetB0 + Custom Head")
    
    return model

# Crear modelo
model = create_weight_estimation_model()

# Mostrar resumen
model.summary()


In [None]:
# ============================================================
# CONFIGURACI√ìN DE ENTRENAMIENTO
# ============================================================

def setup_training_callbacks():
    """Configurar callbacks para entrenamiento"""
    print("‚öôÔ∏è Configurando callbacks de entrenamiento...")
    
    callbacks_list = [
        # Early stopping
        callbacks.EarlyStopping(
            monitor='val_loss',
            patience=CONFIG['early_stopping_patience'],
            restore_best_weights=True,
            verbose=1
        ),
        
        # Reduce learning rate on plateau
        callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=5,
            min_lr=1e-7,
            verbose=1
        ),
        
        # Model checkpoint
        callbacks.ModelCheckpoint(
            filepath=str(MODELS_DIR / 'best_model.h5'),
            monitor='val_loss',
            save_best_only=True,
            verbose=1
        ),
        
        # TensorBoard
        callbacks.TensorBoard(
            log_dir=str(BASE_DIR / 'logs'),
            histogram_freq=1,
            write_graph=True,
            write_images=True
        )
    ]
    
    print(f"‚úÖ {len(callbacks_list)} callbacks configurados")
    return callbacks_list

# Configurar callbacks
training_callbacks = setup_training_callbacks()

# Configurar MLflow
def start_mlflow_run():
    """Iniciar run de MLflow"""
    run = mlflow.start_run(run_name="cattle-weight-base-model")

    mlflow.log_params({
        'dataset': 'CID',
        'model': 'EfficientNetB0',
        'batch_size': CONFIG['batch_size'],
        'learning_rate': CONFIG['learning_rate'],
        'epochs': CONFIG['epochs'],
        'image_size': CONFIG['image_size'],
        'augmentation': 'Albumentations'
    })

    print(f"üî¨ MLflow run iniciado: {run.info.run_id}")
    return run

# Iniciar MLflow run
mlflow_run = start_mlflow_run()


In [None]:
# ============================================================
# ENTRENAMIENTO DEL MODELO
# ============================================================

def train_model():
    """Entrenar modelo base"""
    print("üöÄ Iniciando entrenamiento del modelo base...")
    print(f"üìä Configuraci√≥n: {CONFIG}")
    
    # Calcular steps por √©poca
    steps_per_epoch = len(df_train) // CONFIG['batch_size']
    validation_steps = len(df_val) // CONFIG['batch_size']
    
    print(f"üìà Steps por √©poca: {steps_per_epoch}")
    print(f"üìà Validation steps: {validation_steps}")
    
    # Entrenar modelo
    history = model.fit(
        train_dataset,
        epochs=CONFIG['epochs'],
        validation_data=val_dataset,
        callbacks=training_callbacks,
        verbose=1
    )
    
    print("‚úÖ Entrenamiento completado")
    return history

# Entrenamiento real (requiere datasets preparados y tiempo de ejecuci√≥n con GPU)
history = train_model()


In [None]:
# ============================================================
# EVALUACI√ìN DEL MODELO
# ============================================================

def evaluate_model():
    """Evaluar modelo en conjunto de test"""
    print("üìä Evaluando modelo en conjunto de test...")
    
    # Evaluar modelo
    test_loss, test_mae, test_mse = model.evaluate(test_dataset, verbose=0)

    # Calcular R¬≤ real con predicciones sobre el conjunto de test
    y_true = []
    y_pred = []
    for batch_images, batch_targets in test_dataset:
        predictions = model.predict(batch_images, verbose=0)
        y_true.extend(batch_targets.numpy().astype(float))
        y_pred.extend(predictions.squeeze().astype(float))

    test_r2 = r2_score(y_true, y_pred)

    print(f"üìà RESULTADOS DE EVALUACI√ìN:")
    print(f"   Loss: {test_loss:.2f}")
    print(f"   MAE: {test_mae:.2f} kg")
    print(f"   MSE: {test_mse:.2f}")
    print(f"   R¬≤: {test_r2:.3f}")
    
    # Verificar objetivos
    print(f"\nüéØ VERIFICACI√ìN DE OBJETIVOS:")
    print(f"   R¬≤ ‚â• {CONFIG['target_r2']}: {'‚úÖ' if test_r2 >= CONFIG['target_r2'] else '‚ùå'} ({test_r2:.3f})")
    print(f"   MAE < {CONFIG['max_mae']} kg: {'‚úÖ' if test_mae < CONFIG['max_mae'] else '‚ùå'} ({test_mae:.2f} kg)")
    
    # Log m√©tricas en MLflow
    mlflow.log_metrics({
        'test_loss': test_loss,
        'test_mae': test_mae,
        'test_mse': test_mse,
        'test_r2': test_r2
    })
    
    return {
        'loss': test_loss,
        'mae': test_mae,
        'mse': test_mse,
        'r2': test_r2
    }

# Evaluar modelo
evaluation_results = evaluate_model()


In [None]:
# ============================================================
# EXPORTAR A TFLITE
# ============================================================

def export_to_tflite(model, output_path):
    """Exporta modelo a TFLite optimizado para m√≥vil"""
    print(f"üì± Exportando modelo a TFLite: {output_path}")
    
    # Configurar conversor
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    
    # Optimizaciones para m√≥vil
    converter.optimizations = [tf.lite.Optimize.DEFAULT]
    converter.target_spec.supported_types = [tf.float16]  # FP16 para velocidad
    
    # Cuantizaci√≥n INT8 (opcional, m√°s agresiva)
    # converter.representative_dataset = representative_data_gen
    # converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    
    # Convertir
    tflite_model = converter.convert()
    
    # Guardar
    with open(output_path, 'wb') as f:
        f.write(tflite_model)
    
    # Informaci√≥n del modelo
    model_size_kb = len(tflite_model) / 1024
    model_size_mb = model_size_kb / 1024
    print("‚úÖ Modelo exportado exitosamente")
    print(f"üìè Tama√±o: {model_size_mb:.2f} MB ({model_size_kb:.1f} KB)")
    print("üì± Optimizado para m√≥vil: FP16")
    
    # Log en MLflow
    mlflow.log_artifact(output_path)
    mlflow.log_metric('model_size_kb', model_size_kb)
    mlflow.log_metric('model_size_mb', model_size_mb)
    
    return model_size_kb

# Exportar modelo base
tflite_path = MODELS_DIR / 'generic-cattle-v1.0.0.tflite'
model_size = export_to_tflite(model, tflite_path)

print("\nüéØ MODELO BASE LISTO PARA INTEGRACI√ìN")
print(f"üìÅ Archivo: {tflite_path}")
print(f"üìè Tama√±o: {model_size / 1024:.2f} MB ({model_size:.1f} KB)")
print(f"üî¨ MLflow run: {mlflow_run.info.run_id}")


## üìã Resumen y Pr√≥ximos Pasos


In [None]:
# ============================================================
# RESUMEN FINAL
# ============================================================

def generate_final_summary():
    """Generar resumen final del trabajo realizado"""
    print("üìã RESUMEN FINAL - PERSONA 2: SETUP ML")
    print("=" * 60)
    
    # Resumen de datasets
    print(f"\nüì• DATASETS PROCESADOS:")
    cid_row = datasets_summary[datasets_summary['name'] == 'CID Dataset']
    cid_images = int(cid_row['images'].iloc[0]) if not cid_row.empty else 0
    print(f"   {'‚úÖ' if cid_images else '‚ö†Ô∏è'} CID Dataset: {cid_images:,} im√°genes")
    print(f"   {'‚úÖ' if scraped_images else '‚ö†Ô∏è'} Google Images: {scraped_images:,} im√°genes locales")

    if kaggle_dataset_path and kaggle_dataset_path.exists():
        kaggle_images = len(list(kaggle_dataset_path.glob('**/*.jpg')))
        status_icon = '‚úÖ' if kaggle_images else '‚ö†Ô∏è'
        print(f"   {status_icon} Kaggle Dataset ({KAGGLE_DATASET_ID}): {kaggle_images:,} im√°genes")
    else:
        print("   ‚ö†Ô∏è Kaggle Dataset: Pendiente configuraci√≥n (sube kaggle.json y ejecuta la celda correspondiente)")
    
    # Resumen de an√°lisis
    print(f"\nüìä AN√ÅLISIS COMPLETADO:")
    print(f"   ‚úÖ EDA completo con visualizaciones")
    print(f"   ‚úÖ An√°lisis por raza para estrategia de entrenamiento")
    print(f"   ‚úÖ Pipeline de datos optimizado")
    
    # Resumen de modelo
    print(f"\nü§ñ MODELO BASE:")
    print(f"   ‚úÖ Arquitectura: EfficientNetB0 + Custom Head")
    print(f"   ‚úÖ Par√°metros: {model.count_params():,}")
    print(f"   ‚úÖ TFLite exportado: {model_size / 1024:.2f} MB ({model_size:.1f} KB)")
    print(f"   ‚úÖ MLflow tracking: {mlflow_run.info.run_id}")
    
    # Pr√≥ximos pasos
    print(f"\nüéØ PR√ìXIMOS PASOS:")
    print(f"   1. üîÑ Fine-tuning por raza (Semanas 3-6)")
    print(f"   2. üì∏ Recolecci√≥n Criollo + Pardo Suizo (Semanas 7-8)")
    print(f"   3. üß™ Entrenamiento final (Semanas 9-10)")
    print(f"   4. üì± Integraci√≥n en app m√≥vil")
    
    # Guardar resumen
    summary_data = {
        'completion_date': pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'),
        'datasets_processed': len(datasets_summary),
        'total_images': datasets_summary['images'].sum(),
        'model_architecture': 'EfficientNetB0',
        'model_size_kb': model_size,
        'mlflow_run_id': mlflow_run.info.run_id,
        'status': 'COMPLETADO'
    }
    
    with open(DATA_DIR / 'final_summary.json', 'w') as f:
        json.dump(summary_data, f, indent=2)

    mlflow.end_run()
    
    print(f"\nüíæ Resumen guardado en: {DATA_DIR / 'final_summary.json'}")
    print(f"\nüéâ PERSONA 2: SETUP ML COMPLETADO EXITOSAMENTE")

# Generar resumen final
generate_final_summary()


## üìù Notas Importantes

### ‚ö†Ô∏è Configuraci√≥n Requerida
1. **Kaggle API**: Subir `kaggle.json` para descargar datasets
2. **CID Dataset**: Reemplazar URL simulada con URL real
3. **CattleEyeView**: Solicitar acceso a autores del paper

### üîß Optimizaciones Implementadas
- **Mixed Precision**: FP16 para acelerar entrenamiento
- **Data Pipeline**: Cache + prefetch + shuffle optimizado
- **Augmentation**: Albumentations espec√≠fico para ganado
- **TFLite Export**: Optimizado para m√≥vil

### üìä M√©tricas Objetivo
- **R¬≤ ‚â• 0.95**: Explicaci√≥n 95% de varianza
- **MAE < 5 kg**: Error absoluto promedio
- **Inference < 3s**: Tiempo en m√≥vil

### üéØ Estado Actual
- ‚úÖ **Infraestructura ML**: Completada
- ‚úÖ **Pipeline de datos**: Optimizado
- ‚úÖ **Modelo base**: Listo para fine-tuning
- üîÑ **Pr√≥ximo**: Fine-tuning por raza espec√≠fica
