<a href="https://colab.research.google.com/github/LuisIZ/Labs_DeepLearning/blob/test/notebooks/lab1_animalSounds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratorio 1 - Animal Sound

## Librerías

In [1]:
# Para redes neuronales
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, TensorDataset

# Para visualización de resultados
import matplotlib.pyplot as plt

# Para procesamiento de audio
import torchaudio
import librosa
import librosa.display

# Para métricas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Para manipulación de datos
import numpy as np
import pandas as pd

# Otros
import os
from pathlib import Path
import random
import math
from tqdm import tqdm
import csv
from google.colab import drive

In [2]:
# Libreria para decodificar audio en PyTorch tensors
!pip install torchcodec

Collecting torchcodec
  Downloading torchcodec-0.10.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (11 kB)
Downloading torchcodec-0.10.0-cp312-cp312-manylinux_2_28_x86_64.whl (2.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m86.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchcodec
Successfully installed torchcodec-0.10.0


In [23]:
!pip -q install soundfile

## GPU

In [3]:
!nvidia-smi

Thu Jan 22 21:24:33 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   43C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Seed

In [4]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

## Descripción general



1. ¿Cuál es el objetivo del laboratorio?

Queremos desarrollar un modelo de clasificación multi-etiqueta que pueda identificar correctamente los animales presentes en las grabaciones de la selva amazónica.

2. ¿Qué tipo de datos tenemos en nuestro dataset?

Tenemos los archivos de audio en formato WAV en las carpetas `train/` y `test/`. Además, en la primera carpeta, tenemos un archivo CSV que tiene como primera columna el nombre del archivo o `filename` y el resto de columnas son los nombres de cada especie. En total son 43 columnas, la primera contendra strings mientras que el resto contendrá valores 0 o 1 que indican la ausencia o presencia de la especie en la grabación.

3. ¿Qué herramientas planeamos utilizar?

En principio, planeamos utilizar `PyTorch` para los modelos (para que utilicen redes neuronales), `Matplotlib` para crear los gráficos de nuestros resultados (ej. modelar el descenso de la gradiente o como va evolucionando los losses en la etapa de training y testing), `TorchAudio` o `Librosa` para analizar features del dataset (ej. Mel-spectogram, MFCC, etc.), `Sklearn` para obtener metricas (ej. f1 score, multilabel confusion matrix, ROC/PR curves etc.) y utilizar métodos de reducción de dimensionalidad (ej. PSA, TSNE, etc.), `Pandas` y `Numpy` para manipular la data y sacar alguna métricas estadísticas (ej. promedio, cuartiles, etc.).

4. ¿Qué restricciones tenemos?

Además del plazo de entrega que es de 1 semana, tenemos recursos computacionales limitados. Trabajaremos con Colab para aprovechar la GPU que nos brinda.

## Google Drive


Verificamos el acceso a Google Drive porque estamos desarrollando el laboratorio en VS Code con la extensión de Google Colab y desde la página web de este último para poder acceder sin problemas a la data en Google Drive que está como acceso directo (*symlink*) a una carpeta compartida.

In [5]:
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [6]:
MYDRIVE = Path("/content/drive/MyDrive")

hits = list(MYDRIVE.rglob("Animal Sounds"))
if not hits:
    raise FileNotFoundError("No encuentro la carpeta 'Animal Sounds' dentro de MyDrive. Revisa que el acceso directo exista.")

DATA_ROOT = hits[0]          # la ruta “atajo”
REAL_ROOT = hits[0].resolve() # la ruta real (como symlink)

print("DATA_ROOT:", DATA_ROOT)
print("REAL_ROOT:", REAL_ROOT)

DATA_ROOT: /content/drive/MyDrive/Animal Sounds
REAL_ROOT: /content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds


In [7]:
TRAIN_7Z  = REAL_ROOT / "train.7z"
TEST_7Z   = REAL_ROOT / "test.7z"
TRAIN_CSV = REAL_ROOT / "train.csv"

for p in [TRAIN_7Z, TEST_7Z, TRAIN_CSV]:
    print(p, "=>", p.exists())

/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/train.7z => True
/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/test.7z => True
/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/train.csv => True


Habiendo encontrado las rutas de nuestros archivos:

- La carpeta comprimida con los datos de training (`train.7z`)
- La carpeta comprimida con los datos de testing (`test.7z`)
- El archivo csv con las multi-etiquetas de cada video, indicando que especie suena en el audio, 1, y cual no, 0 (`train.csv`)

En la siguiente sección, procedemos a descomprimir las carpetas, utilizando la herramienta `7z`, y revisar su contenido para poder realizar el Análisis Exploratorio de Datos (o *EDA* por sus siglas en Inglés).

## Dataset

Como trabajar con los datos directamente en Drive sería muy lento, procederemos a guardar el contenido de los archivos descomprimidos en `/content` que es el disco local temporal de la máquina de Colab que ofrece mayor velocidad de I/O. Al ser temporal, si se reinicia el entorno, es necesario repetir la descompresión.

In [8]:
# Verificamos que haya suficiente espacio en /content
!df -h /content

Filesystem      Size  Used Avail Use% Mounted on
overlay         113G   39G   74G  35% /


In [9]:
# Definimos rutas y creamos carpetas para guardar los datos
DRIVE_DATA = Path("/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds")
TRAIN_7Z = DRIVE_DATA / "train.7z"
TEST_7Z  = DRIVE_DATA / "test.7z"
TRAIN_CSV = DRIVE_DATA / "train.csv"

OUT_BASE = Path("/content/data")
TRAIN_OUT = OUT_BASE / "train"
TEST_OUT  = OUT_BASE / "test"

TRAIN_OUT.mkdir(parents=True, exist_ok=True)
TEST_OUT.mkdir(parents=True, exist_ok=True)

print(TRAIN_7Z.exists(), TEST_7Z.exists(), TRAIN_CSV.exists())
print("Extracción a:", OUT_BASE)

True True True
Extracción a: /content/data


In [10]:
# Extraemos los datos
!7z x "/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/train.7z" -o"/content/data/train" -y
!7z x "/content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/test.7z"  -o"/content/data/test"  -y


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.00GHz (50653),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan /content/drive/.shortcut-targe . wXtSAsIL9HQJ3yf/Animal Sounds/                                                                         1 file, 7069498436 bytes (6742 MiB)

Extracting archive: /content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/train.7z
--
Path = /content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds/train.7z
Type = 7z
Physical Size = 7069498436
Headers Size = 727932
Method = LZMA2:23
Solid = +
Blocks = 1

  0%      0% - train/INCT17_20191113_040000_0_3.wav                 

In [11]:
# Verificamos si estan los audios
!find /content/data/train -maxdepth 2 -type f | head
!find /content/data/test  -maxdepth 2 -type f | head

/content/data/train/train/INCT17_20191113_040000_0_3.wav
/content/data/train/train/INCT17_20191113_040000_10_13.wav
/content/data/train/train/INCT17_20191113_040000_11_14.wav
/content/data/train/train/INCT17_20191113_040000_12_15.wav
/content/data/train/train/INCT17_20191113_040000_13_16.wav
/content/data/train/train/INCT17_20191113_040000_14_17.wav
/content/data/train/train/INCT17_20191113_040000_15_18.wav
/content/data/train/train/INCT17_20191113_040000_16_19.wav
/content/data/train/train/INCT17_20191113_040000_17_20.wav
/content/data/train/train/INCT17_20191113_040000_18_21.wav
/content/data/test/test/INCT17_20191125_040000_0_3.wav
/content/data/test/test/INCT17_20191125_040000_10_13.wav
/content/data/test/test/INCT17_20191125_040000_11_14.wav
/content/data/test/test/INCT17_20191125_040000_12_15.wav
/content/data/test/test/INCT17_20191125_040000_13_16.wav
/content/data/test/test/INCT17_20191125_040000_14_17.wav
/content/data/test/test/INCT17_20191125_040000_15_18.wav
/content/data/t

In [12]:
DATA_DIR = Path("/content/data")

def resolve_nested(split_dir: Path):
    nested = split_dir / split_dir.name
    return nested if nested.exists() else split_dir

TRAIN_DIR = resolve_nested(DATA_DIR / "train")
TEST_DIR  = resolve_nested(DATA_DIR / "test")

print("TRAIN_DIR:", TRAIN_DIR)
print("TEST_DIR :", TEST_DIR)
print("Ejemplo train existe:", TRAIN_DIR.exists())
print("Ejemplo test existe :", TEST_DIR.exists())

TRAIN_DIR: /content/data/train/train
TEST_DIR : /content/data/test/test
Ejemplo train existe: True
Ejemplo test existe : True


In [13]:
df = pd.read_csv(TRAIN_CSV)
print(df.shape)
df.head()

(62191, 43)


Unnamed: 0,filename,SPHSUR,BOABIS,SCIPER,DENNAH,LEPLAT,RHIICT,BOALEP,BOAFAB,PHYCUV,...,SCINAS,LEPNOT,ADEMAR,BOAALM,PHYDIS,RHIORN,LEPFLA,SCIRIZ,DENELE,SCIALT
0,INCT20955_20190909_050000_0_3.wav,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,INCT20955_20190909_050000_1_4.wav,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,INCT20955_20190909_050000_2_5.wav,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,INCT20955_20190909_050000_3_6.wav,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,INCT20955_20190909_050000_4_7.wav,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
f0 = df["filename"].iloc[0]
print("Archivo:", f0)
print("Existe?:", (TRAIN_DIR / f0).exists())

Archivo: INCT20955_20190909_050000_0_3.wav
Existe?: True


In [15]:
drive_root = Path("/content/drive")

train7z = list(drive_root.rglob("train.7z"))
test7z  = list(drive_root.rglob("test.7z"))
csvs    = list(drive_root.rglob("train.csv"))

print("train.7z encontrados:", len(train7z))
print("test.7z encontrados:", len(test7z))
print("train.csv encontrados:", len(csvs))

# candidato: carpeta que tenga los 3 archivos
candidates = []
for p in train7z:
    folder = p.parent
    if (folder / "test.7z").exists() and (folder / "train.csv").exists():
        candidates.append(folder)

print("Carpetas candidatas:", len(candidates))
for c in candidates[:10]:
    print(" -", c)

DATASET_DIR = candidates[0]  # si sale >1, elegimos la que corresponde
print("Usando DATASET_DIR:", DATASET_DIR)

train.7z encontrados: 1
test.7z encontrados: 1
train.csv encontrados: 1
Carpetas candidatas: 1
 - /content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds
Usando DATASET_DIR: /content/drive/.shortcut-targets-by-id/1F5_zs2zy0oECJu6NSwXtSAsIL9HQJ3yf/Animal Sounds


## EDA

### Labels

In [None]:
label_cols = df.columns[1:]   # las columnas de todas las especies en un audio
Y = df[label_cols].astype(int)

k = Y.sum(axis=1)
print("Shape:", df.shape)
print("Num audios:", len(df))
print("Num clases:", len(label_cols))
print("Audios únicos:", df["filename"].nunique())
print("Nulos totales:", df.isna().sum().sum())
print("Porcentaje audios sin etiquetas:", (k==0).mean()*100)

In [None]:
plt.figure()
plt.hist(k, bins=np.arange(k.max()+2)-0.5)
plt.xlabel("# etiquetas por audio")
plt.ylabel("conteo")
plt.title("Distribución de cardinalidad multi-label")
plt.show()

In [None]:
# ¿cuántas etiquetas por audio?
k = df[label_cols].sum(axis=1)
k.describe()

In [None]:
# balance por clase
class_counts = Y.sum().sort_values(ascending=False)
print("Top 10 clases:\n", class_counts.head(10))
print("\nBottom 10 clases:\n", class_counts.tail(10))

In [None]:
plt.figure()
plt.plot(class_counts.values)
plt.yscale("log")
plt.xlabel("clase (ordenada por frecuencia)")
plt.ylabel("conteo (log)")
plt.title("Imbalance por clase (escala log)")
plt.show()

In [None]:
# porcentaje de precensia por clase
class_pct = (class_counts / len(df) * 100).sort_values(ascending=False)
class_pct.head(10)

In [None]:
# clases con 0 o muy pocas apariciones
zero_classes = class_counts[class_counts==0].index.tolist()
rare_classes = class_counts[class_counts<100].index.tolist()
print("Clases con 0 en train:", zero_classes)
print("Clases con <100 en train:", len(rare_classes))

### Co-ocurrencias

In [None]:
# matriz co-ocurrencia (42x42)
cooc = (Y.T @ Y).astype(int)
np.fill_diagonal(cooc.values, 0)

# top pares
pairs = []
for i, a in enumerate(label_cols):
    for j, b in enumerate(label_cols):
        if j <= i:
            continue
        c = cooc.loc[a, b]
        if c > 0:
            pairs.append((c, a, b))
pairs.sort(reverse=True)

print("Top 15 pares más comunes:")
for c,a,b in pairs[:15]:
    print(f"{a} + {b}: {c}")

### Audio

In [None]:
def audio_info_fast(path):
    waveform, sample_rate = torchaudio.load(str(path))
    num_channels = waveform.shape[0]
    num_frames = waveform.shape[1]
    duration = num_frames / sample_rate
    return sample_rate, num_frames, duration, num_channels

In [None]:
sample_n = 1000
sample_files = df["filename"].sample(sample_n, random_state=SEED).tolist()

In [None]:
srs, durs, chs = [], [], []
for fn in tqdm(sample_files):
    p = TRAIN_DIR / fn
    sr, nframes, dur, nch = audio_info_fast(p)
    srs.append(sr); durs.append(dur); chs.append(nch)

In [None]:
print("Sample rate (valores únicos aprox):", sorted(set(srs))[:10], "...")
print("Canales (valores únicos):", sorted(set(chs)))
print("Duración promedio:", np.mean(durs), "sec")
print("Duración min/max:", np.min(durs), np.max(durs))

In [None]:
plt.figure()
plt.hist(durs, bins=30)
plt.xlabel("duración (s)")
plt.ylabel("conteo")
plt.title("Distribución de duración (muestra)")
plt.show()

In [None]:
def plot_wave_and_melspec(wav_path, target_sr=22050):
    y, sr = librosa.load(wav_path, sr=target_sr, mono=True)
    plt.figure()
    plt.plot(y)
    plt.title(f"Waveform | sr={sr} | len={len(y)/sr:.2f}s")
    plt.show()

    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, n_fft=1024, hop_length=512)
    S_db = librosa.power_to_db(S, ref=np.max)
    plt.figure()
    librosa.display.specshow(S_db, sr=sr, hop_length=512, x_axis='time', y_axis='mel')
    plt.colorbar(format="%+2.0f dB")
    plt.title("Log-Mel Spectrogram")
    plt.show()

In [None]:
# ejemplo: uno random
fn = df["filename"].iloc[0]
plot_wave_and_melspec(str(TRAIN_DIR/fn))

## Feature Extraction

### For Model 1: MFCC(mean/std) + MLP

In [None]:
# creamos carpeta para almacenar features
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
print(CACHE_DIR)

In [None]:
def extract_mfcc_stats(wav_path, target_sr=22050, n_mfcc=20):
    # Cargamos sin forzar sr para respetar el original
    y, sr = librosa.load(wav_path, sr=None, mono=True)

    # Resample SOLO si fuera necesario (aunque según el EDA, siempre será 22050)
    if sr != target_sr:
        y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)
        sr = target_sr

    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)

    # stats (baseline)
    feat = np.concatenate([mfcc.mean(axis=1), mfcc.std(axis=1)]).astype(np.float32)
    return feat

In [None]:
n_mfcc = 20
feat_names = [f"mfcc_mean_{i}" for i in range(n_mfcc)] + [f"mfcc_std_{i}" for i in range(n_mfcc)]

In [None]:
# =========================
# CONFIG
# =========================
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

H5_PATH = CACHE_DIR / "train_mfcc_stats.h5"
KEY = "train"
BATCH = 256  # ¿probar 256 o 512?

# para evitar el error de "string len limit"
MIN_ITEMSIZE = {"filename": 120}

# =========================
# (OPCIONAL) LIMPIAR H5 SI ESTÁ "MAL CREADO"
# =========================
# Si sale el error de itemsize, lo más limpio es borrar y regenerar
if H5_PATH.exists():
    print("H5 ya existe:", H5_PATH)
    print("Si este archivo fue creado antes SIN min_itemsize, podría fallar. "
          "Si vuelve a fallar, bórralo (os.remove) y reintenta.")
    # Descomentar si se quiere forzar recreación
    # os.remove(H5_PATH)

# =========================
# RECUPERAR PROCESADOS
# =========================
processed = set()
if H5_PATH.exists():
    try:
        with pd.HDFStore(H5_PATH, mode="r") as store:
            if f"/{KEY}" in store.keys():
                processed = set(store.select(KEY, columns=["filename"])["filename"].astype(str).tolist())
    except Exception as e:
        print("No se pudo leer el H5 para resume. Error:", repr(e))
        print("Recomendación: borrar el H5 y regenerar.")
        # os.remove(H5_PATH)
        processed = set()

print("Procesados previamente:", len(processed))

# =========================
# EXTRACCIÓN + GUARDADO POR BATCH
# =========================
rows = []
skipped_missing = 0
skipped_errors = 0
written = 0

for i in tqdm(range(len(df)), desc="Extrayendo MFCC"):
    fn = str(df.loc[i, "filename"])

    if fn in processed:
        continue

    wav_path = TRAIN_DIR / fn
    if not wav_path.exists():
        skipped_missing += 1
        continue

    try:
        feat = extract_mfcc_stats(str(wav_path), target_sr=22050, n_mfcc=n_mfcc)
    except Exception:
        skipped_errors += 1
        continue

    row = {"filename": fn}
    row.update({k: float(v) for k, v in zip(feat_names, feat)})

    for c in label_cols:
        row[c] = int(df.loc[i, c])

    rows.append(row)

    # flush por batch
    if len(rows) >= BATCH:
        out = pd.DataFrame(rows)

        # Guardar (con min_itemsize para filename)
        out.to_hdf(
            H5_PATH,
            key=KEY,
            mode="a",
            format="table",
            append=True,
            data_columns=["filename"],
            min_itemsize=MIN_ITEMSIZE,
            complib="blosc",
            complevel=5
        )

        written += len(out)
        rows = []

# flush final
if rows:
    out = pd.DataFrame(rows)
    out.to_hdf(
        H5_PATH,
        key=KEY,
        mode="a",
        format="table",
        append=True,
        data_columns=["filename"],
        min_itemsize=MIN_ITEMSIZE,
        complib="blosc",
        complevel=5
    )
    written += len(out)

print("Listo. Guardado en:", H5_PATH)
print("Escritos nuevos:", written)
print("Saltados (faltan archivos):", skipped_missing)
print("Saltados (errores lectura/audio):", skipped_errors)

In [None]:
"""
# OJO: Solo correrlo si necesitas volver a extraer y guardar los features
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
H5_PATH = CACHE_DIR / "train_mfcc_stats.h5"

if H5_PATH.exists():
    os.remove(H5_PATH)
    print("Borrado:", H5_PATH)
else:
    print("No existe aún:", H5_PATH)
"""

In [None]:
SAMPLE_N = 300
sample_files = df["filename"].sample(SAMPLE_N, random_state=42).tolist()
wav_paths = [str(TRAIN_DIR / f) for f in sample_files]

In [None]:
H5_PATH = "/content/drive/MyDrive/Lab1_cache/train_mfcc_stats.h5"
df_feat = pd.read_hdf(H5_PATH, key="train")

print("Shape:", df_feat.shape)
print("Cols:", df_feat.columns[:10].tolist())
print("Filenames únicos:", df_feat["filename"].nunique())
print("Nulos totales:", df_feat.isna().sum().sum())

df_feat.head()

### For Model 2: Log-Mel + CNN

In [16]:
# creamos carpeta para almacenar features
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
print(CACHE_DIR)

/content/drive/MyDrive/Lab1_cache


In [26]:
# =========================
# MODEL 2: LOG-MEL + CNN
# Feature Extraction (on-the-fly)
# =========================

import torch
import torchaudio
import numpy as np
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import soundfile as sf  # Import soundfile as sf

# Asegurar label_cols (orden oficial)
if "label_cols" not in globals():
    label_cols = df.columns[1:].tolist()

N_CLASSES = len(label_cols)
print("Num clases:", N_CLASSES)

# --- Config audio ---
SR = 22050
DUR_SEC = 3.0
N_SAMPLES = int(SR * DUR_SEC)  # 66150 para 3 segundos
N_MELS = 128

# Log-Mel en torchaudio (rápido)
mel_tf = torchaudio.transforms.MelSpectrogram(
    sample_rate=SR,
    n_fft=1024,
    hop_length=256,
    n_mels=N_MELS,
    power=2.0
)
amp_to_db = torchaudio.transforms.AmplitudeToDB(stype="power")

def load_audio_fixed(path, sr=SR, n_samples=N_SAMPLES):
    """
    Lector robusto sin torchcodec:
    - Usa soundfile (libsndfile) para leer wav
    - Convierte a mono
    - Resamplea a SR si es necesario
    - Pad/trim a 3 segundos
    Retorna torch.Tensor [1, n_samples]
    """
    y, file_sr = sf.read(str(path), always_2d=True)   # shape: [T, C]
    y = y.mean(axis=1)                                # mono: [T]
    y = y.astype(np.float32)

    # Resample si hace falta
    if file_sr != sr:
        y = librosa.resample(y, orig_sr=file_sr, target_sr=sr)

    # Pad/trim a longitud fija
    if len(y) >= n_samples:
        y = y[:n_samples]
    else:
        y = np.pad(y, (0, n_samples - len(y)))

    # a tensor [1, n_samples]
    wav = torch.from_numpy(y).unsqueeze(0)
    return wav

def wav_to_logmel(wav: torch.Tensor):
    """
    wav: [1, n_samples] -> log-mel: [1, n_mels, frames]
    Normalizamos por archivo para estabilizar entrenamiento.
    """
    S = mel_tf(wav)          # [1, n_mels, frames]
    S = amp_to_db(S)         # log scale

    # normalización por muestra (simple y efectiva para baseline)
    S = (S - S.mean()) / (S.std() + 1e-6)
    return S

class LogMelDataset(Dataset):
    def __init__(self, df_split, audio_dir: Path, label_cols, training=True):
        self.df = df_split.reset_index(drop=True)
        self.audio_dir = audio_dir
        self.label_cols = label_cols
        self.training = training

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        fn = self.df.loc[idx, "filename"]
        path = self.audio_dir / fn

        wav = load_audio_fixed(path)
        x = wav_to_logmel(wav)              # [1, 128, frames]

        y = self.df.loc[idx, self.label_cols].values.astype(np.float32)
        y = torch.from_numpy(y)             # [42]

        return x, y

print("Listo: funciones + Dataset para log-mel.")

Num clases: 42
Listo: funciones + Dataset para log-mel.


## Models: Training

### 1) MFCC(mean/std) + MLP



In [None]:
!ls -lah /content/drive/MyDrive
!ls -lah /content/drive/MyDrive/Lab1_cache

In [None]:
cands = list(Path("/content/drive").rglob("train_mfcc_stats.h5"))
print("Encontrados:", len(cands))
for p in cands[:20]:
    print(p)

In [None]:
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
H5_PATH = CACHE_DIR / "train_mfcc_stats.h5"

assert H5_PATH.exists(), f"No encuentro: {H5_PATH}"

# mira qué keys tiene el HDF5 (para no fallar con el key)
with pd.HDFStore(H5_PATH, mode="r") as store:
    print("Keys:", store.keys())

df_feat = pd.read_hdf(H5_PATH, key="train")
print(df_feat.shape)
df_feat.head()

In [None]:
CACHE_DIR = Path("/content/drive/MyDrive/Lab1_cache")
H5_PATH   = CACHE_DIR / "train_mfcc_stats.h5"
KEY       = "train"

df_feat = pd.read_hdf(H5_PATH, key=KEY)
print("Shape features:", df_feat.shape)
print(df_feat.head(2))

In [None]:
label_cols = df.columns[1:].tolist()  # orden oficial de Kaggle

# features (X): solo MFCC stats
feat_names = [c for c in df_feat.columns if c.startswith("mfcc_mean_") or c.startswith("mfcc_std_")]

# sanity checks
missing = [c for c in label_cols if c not in df_feat.columns]
if missing:
    raise ValueError(f"Faltan estas columnas de labels en df_feat: {missing[:10]} ... (total {len(missing)})")

print("Num feats:", len(feat_names))
print("Num labels:", len(label_cols))

In [None]:
X = df_feat[feat_names].values.astype(np.float32)
y = df_feat[label_cols].values.astype(np.float32)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

train_dl = DataLoader(TensorDataset(torch.tensor(X_train), torch.tensor(y_train)),
                      batch_size=512, shuffle=True)
val_dl   = DataLoader(TensorDataset(torch.tensor(X_val), torch.tensor(y_val)),
                      batch_size=1024, shuffle=False)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

class MLP(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, out_dim),
        )
    def forward(self, x):
        return self.net(x)

model = MLP(X.shape[1], y.shape[1]).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

# ===== pos_weight =====
pos = y_train.sum(axis=0)
neg = y_train.shape[0] - pos
pos_weight = neg / (pos + 1e-6)

# Para clases con 0 positivos, pos_weight se vuelve enorme -> clamp
pos_weight = np.clip(pos_weight, 1.0, 20.0)

crit = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(pos_weight, device=device))

def eval_model(th=0.5):
    model.eval()
    all_pred, all_true = [], []
    with torch.no_grad():
        for xb, yb in val_dl:
            xb, yb = xb.to(device), yb.to(device)
            prob = torch.sigmoid(model(xb)).cpu().numpy()
            all_pred.append(prob)
            all_true.append(yb.cpu().numpy())

    P = np.vstack(all_pred)
    T = np.vstack(all_true)
    Yhat = (P >= th).astype(int)

    micro = f1_score(T, Yhat, average="micro", zero_division=0)
    macro_all = f1_score(T, Yhat, average="macro", zero_division=0)

    # Macro más interpretable: excluye clases sin positivos en la validación
    mask = (T.sum(axis=0) > 0)
    macro_nonzero = f1_score(T[:, mask], Yhat[:, mask], average="macro", zero_division=0) if mask.any() else 0.0

    return micro, macro_all, macro_nonzero

for epoch in range(1, 6):
    model.train()
    total_loss = 0.0
    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        loss = crit(model(xb), yb)
        loss.backward()
        opt.step()
        total_loss += loss.item() * xb.size(0)

    micro, macro_all, macro_nonzero = eval_model(th=0.5)
    print(f"Epoch {epoch} | loss={total_loss/len(train_dl.dataset):.4f} | "
          f"F1 micro={micro:.4f} | F1 macro(all)={macro_all:.4f} | F1 macro(nonzero)={macro_nonzero:.4f}")

In [None]:
def get_probs(model, dl, device):
    model.eval()
    probs_list, true_list = [], []
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device)
            prob = torch.sigmoid(model(xb)).cpu().numpy()
            probs_list.append(prob)
            true_list.append(yb.numpy())
    return np.vstack(probs_list), np.vstack(true_list)

In [None]:
def tune_thresholds_macro_f1(P, T, grid=None):
    """
    P: probs (N,C)
    T: true  (N,C) binario
    Retorna thresholds (C,) que maximizan Macro F1 por clase.
    """
    if grid is None:
        grid = np.linspace(0.05, 0.95, 19)

    C = T.shape[1]
    thr = np.full(C, 0.5, dtype=np.float32)

    for c in range(C):
        # Si en validación no hay positivos para esa clase, no tiene sentido tunear
        # Ponemos umbral 1.0 para evitar predecirla (mejor que adivinar)
        if T[:, c].sum() == 0:
            thr[c] = 1.0
            continue

        best_f1, best_t = -1.0, 0.5
        for t in grid:
            pred = (P[:, c] >= t).astype(int)
            f1 = f1_score(T[:, c], pred, zero_division=0)
            if f1 > best_f1:
                best_f1, best_t = f1, t
        thr[c] = best_t

    return thr

In [None]:
# 1) probs/true de validación
P_val, T_val = get_probs(model, val_dl, device)

In [None]:
# 2) tunear thresholds
thresholds = tune_thresholds_macro_f1(P_val, T_val)

In [None]:
# 3) evaluar macro con thresholds
Yhat = (P_val >= thresholds[None, :]).astype(int)
macro_all = f1_score(T_val, Yhat, average="macro", zero_division=0)

mask = (T_val.sum(axis=0) > 0)
macro_nonzero = f1_score(T_val[:, mask], Yhat[:, mask], average="macro", zero_division=0) if mask.any() else 0.0

print("Macro F1 (all):", macro_all)
print("Macro F1 (nonzero):", macro_nonzero)
print("Ejemplo thresholds (primeras 10):", thresholds[:10])

### 2) Log-Mel + CNN

In [None]:
!ls -lah /content/drive/MyDrive
!ls -lah /content/drive/MyDrive/Lab1_cache

In [29]:
# =========================
# MODEL 2: LOG-MEL + CNN
# Training
# =========================

import torch.nn as nn
from sklearn.metrics import f1_score

# Split train/val
df_train, df_val = train_test_split(df, test_size=0.2, random_state=42)

train_ds = LogMelDataset(df_train, TRAIN_DIR, label_cols, training=True)
val_ds   = LogMelDataset(df_val,   TRAIN_DIR, label_cols, training=False)

# Si tu GPU está por acabarse o estás en CPU: baja batch_size a 16 o 8
BATCH_TRAIN = 32
BATCH_VAL   = 64

# num_workers: en Colab a veces 2 es ok; si te da error, pon 0
train_dl = DataLoader(train_ds, batch_size=BATCH_TRAIN, shuffle=True, num_workers=2, pin_memory=True)
val_dl   = DataLoader(val_ds,   batch_size=BATCH_VAL,   shuffle=False, num_workers=2, pin_memory=True)

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)

class SmallAudioCNN(nn.Module):
    """
    CNN pequeña para log-mel [B, 1, 128, T]
    - Aprende patrones tiempo-frecuencia mejor que MFCC(mean/std)+MLP
    - GAP para no depender del T exacto
    """
    def __init__(self, n_classes):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d((2, 2)),  # 128->64

            nn.Conv2d(16, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d((2, 2)),  # 64->32

            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),

            nn.Conv2d(64, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
        )
        self.gap = nn.AdaptiveAvgPool2d((1, 1))
        self.head = nn.Linear(64, n_classes)

    def forward(self, x):
        x = self.features(x)
        x = self.gap(x).squeeze(-1).squeeze(-1)  # [B,64]
        return self.head(x)                      # logits [B,42]

model = SmallAudioCNN(N_CLASSES).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

# pos_weight para imbalance (igual idea que tu MLP)
Y_train = df_train[label_cols].values.astype(np.float32)
pos = Y_train.sum(axis=0)
neg = Y_train.shape[0] - pos
pos_weight = neg / (pos + 1e-6)
pos_weight = np.clip(pos_weight, 1.0, 20.0)

crit = nn.BCEWithLogitsLoss(pos_weight=torch.tensor(pos_weight, device=device))

def eval_macro_nonzero(model, dl, thr=0.30):
    model.eval()
    P_list, T_list = [], []
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device)
            logits = model(xb)
            prob = torch.sigmoid(logits).cpu().numpy()
            P_list.append(prob)
            T_list.append(yb.numpy())

    P = np.vstack(P_list)
    T = np.vstack(T_list)
    Yhat = (P >= thr).astype(int)

    mask = (T.sum(axis=0) > 0)
    macro_nonzero = f1_score(T[:, mask], Yhat[:, mask], average="macro", zero_division=0) if mask.any() else 0.0
    return macro_nonzero

# Entrenamiento rápido (baseline)
EPOCHS = 10
PATIENCE = 3
THR_VAL = 0.30  # threshold global para evaluar rápido

best = -1.0
bad = 0
best_state = None

for epoch in range(1, EPOCHS + 1):
    model.train()
    total_loss = 0.0

    for xb, yb in train_dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        loss = crit(model(xb), yb)
        loss.backward()
        opt.step()
        total_loss += loss.item() * xb.size(0)

    val_f1 = eval_macro_nonzero(model, val_dl, thr=THR_VAL)
    print(f"Epoch {epoch:02d} | loss={total_loss/len(train_dl.dataset):.4f} | val macro_nonzero@{THR_VAL}={val_f1:.4f}")

    if val_f1 > best + 1e-4:
        best = val_f1
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        bad = 0
    else:
        bad += 1
        if bad >= PATIENCE:
            print("Early stopping.")
            break

# Restaurar mejor modelo
model.load_state_dict({k: v.to(device) for k, v in best_state.items()})
print("Best val macro_nonzero:", best)

Device: cuda
Epoch 01 | loss=0.3602 | val macro_nonzero@0.3=0.2966
Epoch 02 | loss=0.2496 | val macro_nonzero@0.3=0.3438
Epoch 03 | loss=0.2151 | val macro_nonzero@0.3=0.3592
Epoch 04 | loss=0.1933 | val macro_nonzero@0.3=0.3785
Epoch 05 | loss=0.1763 | val macro_nonzero@0.3=0.4561
Epoch 06 | loss=0.1647 | val macro_nonzero@0.3=0.3708
Epoch 07 | loss=0.1555 | val macro_nonzero@0.3=0.4132
Epoch 08 | loss=0.1462 | val macro_nonzero@0.3=0.4782
Epoch 09 | loss=0.1396 | val macro_nonzero@0.3=0.4410
Epoch 10 | loss=0.1336 | val macro_nonzero@0.3=0.4913
Best val macro_nonzero: 0.4912854922999091


## Models: Testing and CSV Kaggle results

### 1) MFCC(mean/std) + MLP

In [None]:
# ===== asegurar variables críticas =====

# 1) TEST_DIR
if "TEST_DIR" not in globals():
    try:
        TEST_DIR = Path(TEST_OUT) / "test" if (Path(TEST_OUT) / "test").exists() else Path(TEST_OUT)
    except Exception:
        raise NameError("TEST_DIR no está definido. Ejecuta primero las celdas de extracción/rutas.")

print("TEST_DIR =", TEST_DIR)

# 2) label_cols (orden oficial para Kaggle)
if "label_cols" not in globals():
    if "df" in globals():
        label_cols = df.columns[1:].tolist()
    else:
        raise NameError("label_cols no está definido porque df no existe. Ejecuta primero la carga de train.csv.")

print("Num labels:", len(label_cols))

# 3) thresholds
if "thresholds" not in globals():
    raise NameError("thresholds no está definido. Ejecuta primero la celda de tuning de thresholds (validación).")

print("thresholds shape:", thresholds.shape)

# 4) model, feat_names
for v in ["model", "feat_names", "extract_mfcc_stats", "n_mfcc", "device"]:
    if v not in globals():
        raise NameError(f"Falta '{v}'. Ejecuta primero Feature Extraction y Model training.")
print("OK: variables listas para submission.")

In [None]:
test_files = sorted([p.name for p in TEST_DIR.rglob("*.wav")])
print(test_files[:10])

In [None]:
print("TEST_DIR existe?", "TEST_DIR" in globals())
print("df existe?", "df" in globals())
print("label_cols:", len(label_cols) if "label_cols" in globals() else None)
print("feat_names:", len(feat_names) if "feat_names" in globals() else None)
print("thresholds:", thresholds.shape if "thresholds" in globals() else None)
print("model existe?", "model" in globals())

In [None]:
# --- Lo que necesitamos que exista en el notebook ---
# TEST_DIR, extract_mfcc_stats, n_mfcc, feat_names, model, device, thresholds, label_cols

# 0) Sanity checks
assert len(label_cols) == 42, f"label_cols debería tener 42 clases y tiene {len(label_cols)}"
assert thresholds.shape[0] == 42, f"thresholds debería ser (42,) y es {thresholds.shape}"

# 1) Listar archivos test
test_files = sorted([p.name for p in TEST_DIR.rglob("*.wav")])
print("Test audios:", len(test_files), "| ejemplo:", test_files[:3])

# 2) Extraer MFCC stats para test
rows = []
for fn in tqdm(test_files, desc="Extrayendo MFCC (test)"):
    wav_path = TEST_DIR / fn
    feat = extract_mfcc_stats(str(wav_path), target_sr=22050, n_mfcc=n_mfcc)

    row = {"filename": fn}
    row.update({k: float(v) for k, v in zip(feat_names, feat)})
    rows.append(row)

df_test_feat = pd.DataFrame(rows)
print("df_test_feat:", df_test_feat.shape)

# 3) Inferencia en test
X_test = df_test_feat[feat_names].values.astype(np.float32)

model.eval()
with torch.no_grad():
    probs_test = torch.sigmoid(model(torch.tensor(X_test, device=device))).cpu().numpy()

assert probs_test.shape[1] == 42, f"Tu modelo debería devolver 42 outputs y devuelve {probs_test.shape[1]}"

# 4) probs -> "1 5 7" (SIEMPRE espacio, NUNCA coma)
def probs_to_pred_str(prob_row, thr):
    active = np.where(prob_row >= thr)[0] + 1  # +1 para 1..42
    active = active[(active >= 1) & (active <= 42)]
    if active.size == 0:
        return "0"
    return " ".join(map(str, active.tolist()))

pred_str = [probs_to_pred_str(probs_test[i], thresholds) for i in range(len(probs_test))]

# 5) Id = nombre del archivo SIN .wav (sin cortar nada)
ids = [Path(fn).stem for fn in df_test_feat["filename"].tolist()]

sub = pd.DataFrame({"Id": ids, "Predicted": pred_str})

# 6) Validaciones IMPORTANTES antes de guardar
print("Columnas:", sub.columns.tolist())
assert sub.shape[1] == 2, "El submission debe tener EXACTAMENTE 2 columnas: Id y Predicted"
assert sub["Id"].str.contains(r"\.wav$", regex=True).sum() == 0, "Id no debe incluir .wav"

# Predicted: sin comas y dentro de 1..42
bad_commas = sub["Predicted"].str.contains(",", regex=False).sum()
print("Filas con coma en Predicted:", bad_commas)
assert bad_commas == 0, "Predicted NO debe contener comas. Debe separar con espacios."

def max_index(s):
    if s == "0":
        return 0
    return max(map(int, s.split()))

mx = sub["Predicted"].map(max_index).max()
print("Máximo índice en Predicted:", mx)
assert mx <= 42, f"Hay índices > 42 en Predicted (máximo={mx}). Eso está mal."

print(sub.head(5))

# 7) Guardar con nombre claro del modelo 1
MODEL_TAG = "model1_mfcc_mlp"
sub_path = Path(f"/content/drive/MyDrive/Lab1_cache/submission_{MODEL_TAG}.csv")

# QUOTE_ALL ayuda a que Excel no “parta” el campo Predicted
sub.to_csv(sub_path, index=False, quoting=csv.QUOTE_ALL)

print("Guardado:", sub_path)

# 8) (Opcional) ver primeras líneas RAW del csv para confirmar formato
with open(sub_path, "r", encoding="utf-8") as f:
    for _ in range(5):
        print(f.readline().strip())

In [None]:
print(sub.head(5))
print("Columnas:", sub.columns.tolist())

# verificamos que Predicted NO tenga comas y que max índice <= 42
bad_commas = sub["Predicted"].str.contains(",", regex=False).sum()
print("Filas con coma:", bad_commas)

def max_idx(s):
    if s == "0": return 0
    return max(map(int, s.split()))

mx = sub["Predicted"].map(max_idx).max()
print("Max índice:", mx)


### 2) Log-Mel + CNN

In [30]:
# =========================
# MODEL 2: LOG-MEL + CNN
# Testing + Kaggle CSV
# =========================

import pandas as pd
import csv
from tqdm import tqdm

# Threshold global (haz 2 submissions rápidas probando 0.30 y 0.25)
THR_TEST = 0.30

test_files = sorted([p.name for p in TEST_DIR.rglob("*.wav")])
print("Test audios:", len(test_files))

model.eval()
pred_str = []

with torch.no_grad():
    for fn in tqdm(test_files, desc="Inferencia test (CNN)"):
        wav = load_audio_fixed(TEST_DIR / fn)     # [1, 66150]
        x = wav_to_logmel(wav).unsqueeze(0)       # [1, 1, 128, frames]
        x = x.to(device)

        prob = torch.sigmoid(model(x)).cpu().numpy().reshape(-1)  # (42,)

        active = np.where(prob >= THR_TEST)[0] + 1  # 1..42

        # Anti-"0": si nada supera THR, usamos top-1 para no perder macro recall
        if active.size == 0:
            s = str(int(np.argmax(prob) + 1))
        else:
            active_sorted = sorted(active.tolist(), key=lambda k: prob[k-1], reverse=True)
            s = " ".join(map(str, active_sorted))

        pred_str.append(s)

sub = pd.DataFrame({
    "Id": [Path(fn).stem for fn in test_files],
    "Predicted": pred_str
})

# Sanity checks
print("Filas con Predicted=='0':", (sub["Predicted"] == "0").sum())
bad_commas = sub["Predicted"].str.contains(",", regex=False).sum()
print("Filas con coma:", bad_commas)

MODEL_TAG = f"model2_logmel_cnn_thr{THR_TEST}"
sub_path = Path(f"/content/drive/MyDrive/Lab1_cache/submission_{MODEL_TAG}.csv")
sub.to_csv(sub_path, index=False, quoting=csv.QUOTE_ALL)

print("Guardado:", sub_path)
print(sub.head())

Test audios: 31187


Inferencia test (CNN): 100%|██████████| 31187/31187 [02:35<00:00, 200.52it/s]


Filas con Predicted=='0': 0
Filas con coma: 0
Guardado: /content/drive/MyDrive/Lab1_cache/submission_model2_logmel_cnn_thr0.3.csv
                             Id Predicted
0    INCT17_20191125_040000_0_3   27 8 10
1  INCT17_20191125_040000_10_13      27 8
2  INCT17_20191125_040000_11_14        27
3  INCT17_20191125_040000_12_15        27
4  INCT17_20191125_040000_13_16        27
