
# UTKFace — Edad (Regresión) con Enfoque en Menores (0–18)

Este notebook entrena un **modelo multitarea** (regresión de edad + clasificación por rangos)
enfocado en **minimizar errores en menores** (0–18), especialmente en adolescentes (13–18).
Incluye:
- Pipeline `tf.data` desde carpeta local del dataset UTKFace (formato de archivo: `edad_genero_raza_*.jpg`).
- **Pérdida ponderada** para dar más peso a errores en menores: `weighted_MAE`.
- **Cabezal auxiliar** de clasificación en 4 buckets: Niño (0–12), Adolescente (13–18), Adulto Joven (19–30), Adulto (31+).
- Métricas por subgrupos, **MAE en menores**, y umbral conservador para filtrar acceso.
- Callbacks para early stopping y mejor modelo.

> **Requisitos:** TensorFlow 2.10+ (o compatible con tu GPU), Pillow, scikit-learn.


In [None]:

# === Configuración ===
DATA_DIR = r"C:/ruta/a/UTKFace"  # <-- Cambia a la carpeta donde están las imágenes UTKFace (.jpg)
IMG_SIZE = (224, 224)
BATCH_SIZE = 64
SEED = 42

# Ponderaciones de pérdidas
LOSS_W_AGE = 0.7
LOSS_W_BUCKET = 0.3

# Peso extra a errores en menores (<19)
MINOR_AGE_CUTOFF = 19
WEIGHT_FOR_MINOR = 3.0
WEIGHT_FOR_ADULT = 1.0

# Entrenamiento
EPOCHS = 60
LEARNING_RATE = 1e-4


In [None]:

import os, re, math, random, glob, json
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.applications import EfficientNetB0
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, classification_report
from PIL import Image

print("TensorFlow:", tf.__version__)
print("GPU disponible:", tf.config.list_physical_devices('GPU'))


In [None]:

# === Cargar rutas y extraer edades del nombre de archivo ===
# Formato UTKFace: [age]_[gender]_[race]_[date&time].jpg
pat = re.compile(r"^(\d+)_([01])_([0-4])_.*\.jpg$", re.I)

paths = sorted(glob.glob(os.path.join(DATA_DIR, "*.jpg")))
ages = []
file_paths = []

for p in paths:
    fname = os.path.basename(p)
    m = pat.match(fname)
    if not m:
        continue
    age = int(m.group(1))
    # Filtramos edades imposibles si existieran (por seguridad)
    if age < 0 or age > 116:
        continue
    ages.append(age)
    file_paths.append(p)

ages = np.array(ages, dtype=np.float32)
file_paths = np.array(file_paths)
print("Total imágenes válidas:", len(file_paths))
print("Rango de edades:", ages.min(), "->", ages.max())


In [None]:

# === Buckets de edad ===
# 0: Niño (0–12), 1: Adolescente (13–18), 2: Adulto Joven (19–30), 3: Adulto (31+)
def age_to_bucket(age):
    a = int(age)
    if a <= 12: return 0
    if a <= 18: return 1
    if a <= 30: return 2
    return 3

buckets = np.array([age_to_bucket(a) for a in ages], dtype=np.int32)
class_names = ["child(0-12)", "teen(13-18)", "young(19-30)", "adult(31+)"]
unique, counts = np.unique(buckets, return_counts=True)
print("Distribución buckets:", dict(zip([class_names[u] for u in unique], counts)))


In [None]:

# === Split estratificado por buckets ===
train_idx, test_idx = train_test_split(
    np.arange(len(file_paths)), test_size=0.15, random_state=SEED, stratify=buckets
)
train_idx, val_idx = train_test_split(
    train_idx, test_size=0.1765, random_state=SEED, stratify=buckets[train_idx]
)  # ~15% val, 70% train, 15% test

def take(idx):
    return file_paths[idx], ages[idx], buckets[idx]

train_paths, train_ages, train_buckets = take(train_idx)
val_paths, val_ages, val_buckets = take(val_idx)
test_paths, test_ages, test_buckets = take(test_idx)

print("Train:", len(train_paths), " Val:", len(val_paths), " Test:", len(test_paths))

# Chequear representación de adolescentes en splits
def bucket_stats(name, b):
    u,c = np.unique(b, return_counts=True)
    print(name, dict(zip(u, c)))

bucket_stats("Train buckets", train_buckets)
bucket_stats("Val buckets", val_buckets)
bucket_stats("Test buckets", test_buckets)


In [None]:

# === tf.data pipeline ===
AUTOTUNE = tf.data.AUTOTUNE

def preprocess_image(path):
    img = tf.io.read_file(path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, IMG_SIZE)
    img = tf.cast(img, tf.float32) / 255.0
    return img

def augment(img):
    # Aumentos suaves que no distorsionan facciones
    img = tf.image.random_flip_left_right(img)
    img = tf.image.random_contrast(img, 0.9, 1.1)
    img = tf.image.random_brightness(img, 0.05)
    # recorte aleatorio leve
    crop_frac = tf.random.uniform([], 0.9, 1.0)
    new_size = tf.cast(tf.multiply(tf.constant(IMG_SIZE, tf.float32), crop_frac), tf.int32)
    img = tf.image.resize_with_crop_or_pad(img, IMG_SIZE[0], IMG_SIZE[1])
    img = tf.image.resize(img, IMG_SIZE)
    return img

def make_ds(paths, ages, buckets, train=False):
    ds_paths = tf.data.Dataset.from_tensor_slices(paths)
    ds_ages = tf.data.Dataset.from_tensor_slices(ages)
    ds_buckets = tf.data.Dataset.from_tensor_slices(buckets)
    ds = tf.data.Dataset.zip((ds_paths, ds_ages, ds_buckets))

    def _map(path, age, bucket):
        img = preprocess_image(path)
        if train:
            img = augment(img)
        # targets: age (float), buckets one-hot
        bucket_oh = tf.one_hot(bucket, 4, dtype=tf.float32)
        return img, {"age": tf.expand_dims(age, -1), "bucket": bucket_oh}

    if train:
        ds = ds.shuffle(8192, seed=SEED, reshuffle_each_iteration=True)
    ds = ds.map(_map, num_parallel_calls=AUTOTUNE).batch(BATCH_SIZE).prefetch(AUTOTUNE)
    return ds

train_ds = make_ds(train_paths, train_ages, train_buckets, train=True)
val_ds   = make_ds(val_paths,   val_ages,   val_buckets,   train=False)
test_ds  = make_ds(test_paths,  test_ages,  test_buckets,  train=False)


In [None]:

# === Pérdida ponderada para priorizar menores ===
@tf.function
def weighted_mae(y_true, y_pred):
    # y_true shape: (batch, 1)
    age_true = y_true[:, 0]
    weights = tf.where(age_true < MINOR_AGE_CUTOFF, WEIGHT_FOR_MINOR, WEIGHT_FOR_ADULT)
    mae = tf.abs(y_true - y_pred)[:, 0]
    return tf.reduce_mean(weights * mae)


In [None]:

# === Modelo multitarea ===
base = EfficientNetB0(include_top=False, input_shape=IMG_SIZE+(3,), weights='imagenet', pooling='avg')
x = layers.Dropout(0.3)(base.output)

age_output = layers.Dense(1, activation='linear', name='age')(x)
bucket_output = layers.Dense(4, activation='softmax', name='bucket')(x)

model = models.Model(inputs=base.input, outputs=[age_output, bucket_output])
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    loss={'age': weighted_mae, 'bucket': 'categorical_crossentropy'},
    loss_weights={'age': LOSS_W_AGE, 'bucket': LOSS_W_BUCKET},
    metrics={
        'age': ['mae'],
        'bucket': ['accuracy', tf.keras.metrics.Precision(name='precision'), tf.keras.metrics.Recall(name='recall')]
    }
)
model.summary()


In [None]:

# === Callbacks ===
ckpt_path = "best_utkface_minor_focus.keras"
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(ckpt_path, monitor='val_bucket_recall', mode='max', save_best_only=True, verbose=1),
    tf.keras.callbacks.EarlyStopping(monitor='val_bucket_recall', mode='max', patience=8, restore_best_weights=True, verbose=1),
    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1)
]


In [None]:

# === Entrenamiento ===
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    callbacks=callbacks
)


In [None]:

# === Evaluación detallada ===
pred_age, pred_bucket_prob = model.predict(test_ds, verbose=1)
y_true_age = []
y_true_bucket = []

for _, y in test_ds:
    y_true_age.append(y['age'].numpy())
    y_true_bucket.append(y['bucket'].numpy())
y_true_age = np.concatenate(y_true_age, axis=0)[:,0]
y_true_bucket = np.argmax(np.concatenate(y_true_bucket, axis=0), axis=1)

pred_age = pred_age[:,0]
pred_bucket = np.argmax(pred_bucket_prob, axis=1)

mae_global = mean_absolute_error(y_true_age, pred_age)
print("MAE global:", round(mae_global, 3))

def subgroup_mae(mask, name):
    if mask.sum() == 0:
        print(f"{name}: sin datos")
        return
    print(f"{name} MAE:", round(mean_absolute_error(y_true_age[mask], pred_age[mask]), 3))

mask_child = (y_true_age <= 12)
mask_teen  = (y_true_age >= 13) & (y_true_age <= 18)
mask_young = (y_true_age >= 19) & (y_true_age <= 30)
mask_adult = (y_true_age >= 31)

subgroup_mae(mask_child, "Niños (0–12)")
subgroup_mae(mask_teen,  "Adolescentes (13–18)")
subgroup_mae(mask_young, "Adultos jóvenes (19–30)")
subgroup_mae(mask_adult, "Adultos (31+)")

print("\nReporte de clasificación por buckets (pred vs real):")
print(classification_report(y_true_bucket, pred_bucket, target_names=class_names, digits=3))


In [None]:

# === Política conservadora para acceso ===
# Regla: si el modelo predice "adolescente" con prob >= teen_threshold y edad_pred < 21,
# forzar edad final = min(edad_pred, 18). Esto reduce falsos negativos (menores que pasan).
teen_threshold = 0.35  # puedes ajustar entre 0.3 y 0.6 según recall
teen_index = 1

def final_age_and_access(age_pred, bucket_prob, teen_threshold=teen_threshold):
    teen_prob = float(bucket_prob[teen_index])
    final_age = float(age_pred)
    if teen_prob >= teen_threshold and final_age < 21:
        final_age = min(final_age, 18.0)
    # Política de acceso: permitir solo si final_age >= 18.5 (margen de seguridad)
    can_access = final_age >= 18.5
    return final_age, can_access

# Ejemplo usando el primer batch del test
for imgs, y in test_ds.take(1):
    pa, pb = model.predict(imgs, verbose=0)
    for i in range(min(5, len(pa))):
        fa, ok = final_age_and_access(pa[i][0], pb[i])
        print(f"Edad_pred={pa[i][0]:.2f} -> Edad_final={fa:.2f} | Acceso={'PERMITIDO' if ok else 'DENEGADO'}")


In [None]:

# === Guardar historia y notas ===
with open("training_history_minor_focus.json", "w", encoding="utf-8") as f:
    json.dump({k: [float(x) for x in v] for k,v in history.history.items()}, f, ensure_ascii=False, indent=2)

print("Guardado: best model ->", "best_utkface_minor_focus.keras")
print("Guardado: history -> training_history_minor_focus.json")



## Consejos y Ajustes
- **Si el modelo deja pasar menores** (falsos negativos):
  - Aumenta `WEIGHT_FOR_MINOR` a 4.0 o 5.0.
  - Baja `teen_threshold` a 0.3 y/o sube el umbral de acceso a `>= 19.0`.
  - Aumenta `LOSS_W_BUCKET` a 0.4 para reforzar la cabeza de clasificación.
- **Si el MAE en adultos sube mucho** pero menores están bien, está bien para tu caso de uso (seguridad).
- **Batch size**: si te quedas sin VRAM, baja a 32.
- **Backbone**: puedes cambiar a `ResNet50` si EfficientNet no te rinde igual en tu GPU.
- **Data leakage**: asegúrate de que los splits no se mezclen si usas subsets externos.
- **Evaluación realista**: prueba con imágenes propias (sin barba/afeitado/maquillaje y con), para ver comportamiento conservador.
