# Step-by-Step: From MFCC ```.npy``` Files to a Model

## Step 1: Load Metadata

We’ll read ```mfcc_metadata.csv``` to get:
- The name of the ```.npy``` file with MFCCs,
- The ```queen_status``` label.

In [None]:
from pathlib import Path
import pandas as pd

# Load metadata file
metadata_df = pd.read_csv("mfcc_metadata.csv")

# Optional: Show distribution of labels
print(metadata_df["queen_status"].value_counts())

queen_status
3    3563
2    1553
0    1038
1     946
Name: count, dtype: int64


In [None]:
META_PATH = Path("mfcc_files_03-1") / "metadata.csv"
print("Ruta:", META_PATH.resolve())
print("Existe:", META_PATH.exists())
print("Tamaño bytes:", META_PATH.stat().st_size if META_PATH.exists() else -1)

# Muestra las primeras líneas (si tuviera)
import itertools
if META_PATH.exists() and META_PATH.stat().st_size > 0:
    with open(META_PATH, "r", encoding="utf-8", errors="replace") as f:
        print("Preview:\n", "".join(list(itertools.islice(f, 5))))


Ruta: C:\Users\leona\Documents\Thesis_Project_UACH\Development(Code)\mfcc_files_03-1\metadata.csv
Existe: True
Tamaño bytes: 2
Preview:
 



In [None]:
mfcc_dir = Path("mfcc_files_03-1") / "mfcc_cmvn_noc0"   # o la subcarpeta real que uses
files = sorted(mfcc_dir.glob("*.npy"))
print("NPYs encontrados:", len(files))


NPYs encontrados: 7100


In [None]:
mfcc_dir = Path("mfcc_files_03-1") / "mfcc_cmvn_noc0"   # ajusta si usas otra carpeta
files = sorted(mfcc_dir.glob("*.npy"))

df = pd.DataFrame({
    "mfcc_file": [f.name for f in files],
    "mfcc_path": [str(f) for f in files],
    # TODO: añade tus etiquetas reales aquí haciendo merge con tu fuente de labels
    # "queen_status": ...
})
df.to_csv(Path("mfcc_files_03-1") / "metadata.csv", index=False, encoding="utf-8")
print("metadata.csv reconstruido con", len(df), "filas")


metadata.csv reconstruido con 7100 filas


In [20]:
metadata_df = pd.read_csv("mfcc_files_03-1/metadata.csv")

## Step 2: Prepare Data (Fixed Length for Simpler Models)

We’ll:
- Load the MFCC matrix from each .npy file,
- Pad or truncate it to a fixed number of frames (say, 100),
- Flatten it into a 1D vector (so you can use traditional models like Random Forest or Logistic Regression).

In [21]:
pwd

'c:\\Users\\leona\\Documents\\Thesis_Project_UACH\\Development(Code)'

In [23]:
# subcarpeta donde 03-1 guardó los MFCC
MFCC_SUBDIR = "mfcc_cmvn_noc0"   # usa "mfcc_raw" si cambiaste flags

def pick_label_column(df):
    for c in ["queen_status", "label", "has_queen", "status"]:
        if c in df.columns:
            return c
    raise RuntimeError(
        "Tu metadata no trae columna de etiqueta ('queen_status' / 'label'). "
        "Necesitas añadir etiquetas o hacer un merge con tu archivo de labels."
    )

def derive_mfcc_path(row):
    # Usa ruta directa si ya existe una columna tipo 'mfcc_path'
    for c in ["mfcc_path", "mfcc_file", "mfcc"]:
        if c in row and isinstance(row[c], str) and row[c]:
            p = Path(row[c])
            if p.is_absolute() or str(p).startswith("mfcc_files_03-1"):
                return str(p)
            return str(Path("mfcc_files_03-1") / MFCC_SUBDIR / p.name)
    # Si no hay columna, construye a partir de 'id'
    if "id" in row:
        return str(Path("mfcc_files_03-1") / MFCC_SUBDIR / f"{row['id']}_mfcc.npy")
    raise RuntimeError("No hay forma de deducir el nombre del .npy (falta 'mfcc_file' o 'id').")

# Aplica las transformaciones
metadata_df["mfcc_path"] = metadata_df.apply(derive_mfcc_path, axis=1)
LABEL_COL = pick_label_column(metadata_df)

# Sanity check: archivos existentes
missing = [p for p in metadata_df["mfcc_path"] if not Path(p).exists()]
if missing:
    print("Archivos MFCC que no existen (muestra 5):", missing[:5])
    raise FileNotFoundError("Hay rutas de MFCC inexistentes. Revisa MFCC_SUBDIR y nombres.")

RuntimeError: Tu metadata no trae columna de etiqueta ('queen_status' / 'label'). Necesitas añadir etiquetas o hacer un merge con tu archivo de labels.

In [22]:
import numpy as np
import os

# Settings
mfcc_dir = "mfcc_files_03-1/"
n_mfcc = 12
max_frames = 100  # You can adjust this if needed

# Containers for features and labels
X = []
y = []

# Loop through metadata and load MFCCs
for idx, row in metadata_df.iterrows():
    file_path = os.path.join(mfcc_dir, row["mfcc_file"])
    label = row["queen_status"]

    # Load MFCC matrix
    mfcc = np.load(file_path)

    # Pad or truncate to fixed length
    if mfcc.shape[0] < max_frames:
        pad_width = max_frames - mfcc.shape[0]
        mfcc = np.pad(mfcc, ((0, pad_width), (0, 0)), mode="constant")
    else:
        mfcc = mfcc[:max_frames]

    # Flatten to 1D vector: [100 frames × 13 mfccs] → 1300 features
    X.append(mfcc.flatten())
    y.append(label)

X = np.array(X)
y = np.array(y)

print("Loaded and prepared all data:", X.shape, y.shape)


KeyError: 'queen_status'

## Step 3: Train a Simple Model (Random Forest)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.35      0.49       208
           1       0.63      0.18      0.28       189
           2       0.64      0.46      0.53       310
           3       0.65      0.97      0.78       713

    accuracy                           0.66      1420
   macro avg       0.69      0.49      0.52      1420
weighted avg       0.67      0.66      0.62      1420



## What We’ll Get:
- A first working classifier that predicts queen status from MFCC audio features.
- Precision, recall, F1-score — all the standard metrics.
- A real baseline you can compare future models to (like RNNs or CNNs).

## Next Levels
- Normalize MFCCs before modeling.
- Try PCA for feature reduction.
- Train time-distributed models (like LSTMs) using the full [frames × 13] matrices.
- Visualize samples: MFCC heatmaps vs. queen status.