# 🎤 Domain Specific Audio Classification

This notebook provides a starting point for working with audio in our domain.
It covers:
1. Dataset loading
2. Dataset Exploration
3. Basic Feature Exploration
4. Preprocessing Techniques
5. 4Fluid Audio Features
6. 4Fluid Audio Classification

# 1. Dataset Loading

To load the dataset, we use the function `dvc_load_database_metadata` from the `data.dvc_loader` module.
This function fetches dataset metadata from the specified DVC database and set.
Before running this step, **make sure to download the database metadata from DVC** (e.g., using `dvc pull`).

The audio files referenced in the dataset must be available in the folder `data/audios`, otherwise will be downloaded during the function call.

Below is the formal structure used:

## Parameters

- `database`: Specifies the database source (e.g., `DatabasesDVC.ONBOARDING`).
- `dataset`: Specifies the dataset type (e.g., `DatabaseSetsDVC.TRAIN` or `DatabaseSetsDVC.TEST`).

## Dataset Metadata Structure

The metadata file is typically a table with the following columns:

- **`id`**: Unique identifier for each audio sample.
- **`class`**: The label or target class associated with the audio sample.
- **`url_audio`**: Path or URL pointing to the corresponding audio file (resolved to `data/audio`).

In [1]:
from sklearn.metrics import confusion_matrix

from data.dvc_loader import dvc_load_database_metadata
from data.dvc_catalog import DatabasesDVC, DatabaseSetsDVC
from research.clustering_embeddings_soft_voting.dash_units import plot_mixture_model_dispersion

df_train = dvc_load_database_metadata(DatabasesDVC.ONBOARDING, DatabaseSetsDVC.TRAIN)
df_test = dvc_load_database_metadata(DatabasesDVC.ONBOARDING, DatabaseSetsDVC.TEST)

display(df_train.head())
display(df_test.head())

2025-09-15 15:44:48.237259: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-15 15:44:48.427718: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-15 15:44:49.485480: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


Unnamed: 0,id,class,url_audio,path
0,d6926f64-65e5-47c7-99c4-700d493d599a,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
1,f2c7ef7e-ddd9-469d-859a-031948b08d6b,LEAK,https://4fluid-samples-audios.s3.sa-east-1.ama...,/home/stevan/Documentos/4fluid-ia/data/audios/...
2,c5e299e9-914f-426b-b5c0-da25ce47d9cb,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
3,260b0c06-ab60-4657-bfc0-e6e6a066712a,LEAK,https://4fluid-samples-audios.s3.sa-east-1.ama...,/home/stevan/Documentos/4fluid-ia/data/audios/...
4,6731a2ff-5961-4f8c-8a25-e6d37aa2eef3,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...


Unnamed: 0,id,class,url_audio,path
0,09753cbd-2201-4e21-bd83-eadba3e14130,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
1,244e2f24-33c2-47c4-829e-6b395e80041f,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
2,7373d90a-0d0d-4bce-abe0-fa083588b8c1,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
3,e942abf2-16e2-49b2-bd64-e1b3c51ebf1a,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
4,c6d62a6f-88bb-4915-9968-61491b2488ef,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...


# 2. Dataset Exploration: Example Audio Samples

Now that we have the dataset metadata, we can explore some examples from each class.
The idea is to visualize the raw audio as **mel spectrograms**, which are widely used features in audio analysis and machine learning.

### What is a Mel Spectrogram?
- A **spectrogram** shows how the frequency content of a signal evolves over time.
- The **mel scale** is a perceptual scale that maps frequencies to how humans perceive pitch, compressing high frequencies and emphasizing low ones.
- By plotting mel spectrograms, we can see the differences between classes in a way that resembles how humans perceive sound.

### Procedure
1. We randomly select `n` samples from each class.
2. For each audio file:
   - Load the waveform from `data/audio/`.
   - Compute the **mel spectrogram** using `librosa`.
   - Convert the power values to decibels (dB) for better visualization.
3. Display the spectrograms in a grid:
   - **Columns** = classes.
   - **Rows** = different examples.

This helps us:
- Get an intuition of the dataset distribution.
- Verify if files are correctly loaded.
- Observe class-specific acoustic patterns (e.g., leakage frequencies vs. non-leakage, external noise-sound events).

In [2]:
import librosa
import librosa.feature
import numpy as np
from matplotlib import pyplot as plt

n_samples = 5

examples = (
    df_train.groupby(by="class")[df_train.columns]
      .apply(lambda x: x.sample(n=min(n_samples, len(x)), random_state=42))
      .reset_index(drop=True)
)

examples

Unnamed: 0,id,class,url_audio,path
0,c849b755-b74e-4f97-a5be-cf9d52f3dea7,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
1,1fa664d3-6051-46b8-9274-3e13a1ca9609,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
2,ac9dc0c3-f342-4095-97a7-43e6f0317edb,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
3,5d9c4c7f-a834-4290-ae66-69beeb492d42,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
4,5ff39295-1e1d-4328-9f37-25f40fc97c96,LEAK,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
5,1790d393-122c-49c6-8cf7-7c2f6bf1efdb,LESS,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
6,1c8d6db2-ba02-4fe6-a058-1bb88c6bb3a5,LESS,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
7,550e74a2-0a99-4caa-8014-ca4cd3fbd565,LESS,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
8,e94a8a69-d8e0-4c64-a9d6-7f836c6c5ab9,LESS,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...
9,245aa098-5c19-4352-8053-834e5834bc4b,LESS,https://4fluid-samples-audio.s3.sa-east-1.amaz...,/home/stevan/Documentos/4fluid-ia/data/audios/...


In [3]:
filename = examples.iloc[0]["path"]
y, sr = librosa.load(filename, sr=None, mono=True)
duration = librosa.get_duration(y=y, sr=sr)

display(f"Sample rate {sr}Hz")
display(f"Duration {duration}s")

'Sample rate 16000Hz'

'Duration 10.0s'

In [4]:
# Plot mel spectrograms
fig, axes = plt.subplots(
    nrows=n_samples, ncols=examples["class"].nunique(),
    figsize=(15, 2.5 * n_samples)
)

if axes.ndim == 1:
    axes = axes[None, :]  # ensure 2D

for col_idx, (cls, group) in enumerate(examples.groupby("class")):
    for row_idx, (_, row) in enumerate(group.iterrows()):
        y, sr = librosa.load(row["path"], sr=None, mono=True)
        S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=64)
        S_db = librosa.power_to_db(S, ref=np.max)

        ax = axes[row_idx, col_idx]
        img = librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="mel", ax=ax)
        ax.set_title(f"Class: {cls}, id: {row['id']}")
        fig.colorbar(img, ax=ax, format="%+2.0f dB")

plt.tight_layout()
plt.show()

  plt.show()


# 3. Basic Feature Exploration

To understand our dataset better, we extract simple audio features from each sample.
We look at both **temporal** and **spectral** characteristics:

- **RMS Energy**: overall signal energy.
- **Zero Crossing Rate (ZCR)**: how often the signal changes sign, linked to noisiness.
- **Spectral Centroid**: "center of mass" of the spectrum, perceived brightness.
- **Spectral Bandwidth**: spread of the spectrum around the centroid.


In [5]:
import pandas as pd
import seaborn as sns

def extract_features(audio_path):
    y, sr = librosa.load(audio_path, sr=None, mono=True)

    rms = np.mean(librosa.feature.rms(y=y))
    zcr = np.mean(librosa.feature.zero_crossing_rate(y=y))
    centroid = np.mean(librosa.feature.spectral_centroid(y=y, sr=sr))
    bandwidth = np.mean(librosa.feature.spectral_bandwidth(y=y, sr=sr))

    return {
        "rms": rms,
        "zcr": zcr,
        "centroid": centroid,
        "bandwidth": bandwidth,
    }

# Extract features for all dataset entries
features = []
for _, row in df_train.iterrows():
    feats = extract_features(row["path"])
    feats["class"] = row["class"]
    feats["id"] = row["id"]
    features.append(feats)

features_df = pd.DataFrame(features)
features_df.head()

Unnamed: 0,rms,zcr,centroid,bandwidth,class,id
0,0.002495,0.129271,1677.504967,1775.653236,LEAK,d6926f64-65e5-47c7-99c4-700d493d599a
1,0.004748,0.166485,1753.877128,1530.121579,LEAK,f2c7ef7e-ddd9-469d-859a-031948b08d6b
2,0.029198,0.14547,1527.576173,1526.895509,LEAK,c5e299e9-914f-426b-b5c0-da25ce47d9cb
3,0.003144,0.084942,1234.877993,1535.706186,LEAK,260b0c06-ab60-4657-bfc0-e6e6a066712a
4,0.004198,0.156521,1706.933305,1581.159489,LEAK,6731a2ff-5961-4f8c-8a25-e6d37aa2eef3


By plotting their distributions across classes, we can check if features differ and might be useful for classification.

This helps identify whether combinations of features (e.g., spectral centroid vs. bandwidth) separate the classes better than a single feature alone.

We’ll use **Seaborn’s `pairplot`** to generate scatter plots for each feature pair, color-coded by class.

In [6]:
pairplot_df = features_df[["rms", "zcr", "centroid", "bandwidth", "class"]]

sns.pairplot(pairplot_df, hue="class", diag_kind="kde", corner=True, plot_kws={"alpha":0.6})
plt.suptitle("Pairwise Feature Distributions by Class", y=1.02)
plt.show()

  plt.show()


# 4. Preprocessing Techniques

Before training models, raw audio typically goes through a series of **preprocessing steps**.
These steps ensure consistency across the dataset, reduce noise, and highlight features relevant to the task.
Below are common techniques:

---

### 1. Resampling
- **What**: Convert all audio to a fixed sample rate (e.g., 16 kHz).
- **Why**: Ensures uniformity; models cannot easily handle inputs with varying sampling rates.

---

### 2. Trimming / Silence Removal
- **What**: Remove leading/trailing silence.
- **Why**: Prevents silence from dominating features and wasting model capacity.

---

### 3. Normalization
- **What**: Scale audio amplitude (e.g., peak or RMS normalization).
- **Why**: Ensures that loudness does not bias the model.
- **Approach**: Divide waveform by max amplitude or normalize RMS energy.

---

### 4. Framing & Windowing
- **What**: Split the audio into overlapping short frames (e.g., 25 ms with 10 ms overlap).
- **Why**: Captures temporal dynamics; most features (spectrograms, MFCCs) are computed per frame.

---

### 5. Feature Extraction
Common feature representations:
- **Spectrogram / Mel Spectrogram**: Time-frequency representation, perceptually relevant.
- **MFCC (Mel-Frequency Cepstral Coefficients)**: Compressed mel representation, popular in speech/audio tasks.

---

### 6. Data Augmentation (optional)
- **What**: Apply transformations to increase variability.
- **Examples**:
  - Pitch shifting
  - Adding background noise
- **Why**: Improves model robustness to real-world variations.

---

### 7. Padding / Cropping
- **What**: Standardize clip lengths (fixed duration).
- **Why**: Models often expect consistent input size.
- **Approach**:
  - Pad with zeros if shorter.
  - Crop or segment if longer.

---

## Summary
By combining these preprocessing techniques, we ensure that:
- All samples have consistent format and quality.
- Irrelevant parts (silence, noise) are reduced.
- Relevant acoustic patterns are emphasized for classification.


## 4.1 Filtering

Below we show how to apply a simple filter to the audio and then split it into frames for feature extraction.
- **High-pass filter**: Remove low-frequency noise such as hums (< 150 Hz).
- **Median filter**: Suppress short bursts of noise (e.g., spectral median filtering)


In [7]:
from scipy.signal import butter, lfilter, medfilt

# Define high-pass filter
def butter_highpass(cutoff, fs, order=5):
    nyquist = 0.5 * fs
    high = cutoff / nyquist
    b, a = butter(order, high, btype="high")
    return b, a

def highpass_filter(data, cutoff, fs, order=5):
    b, a = butter_highpass(cutoff, fs, order=order)
    return lfilter(b, a, data)


# Plot mel spectrograms
fig, axes = plt.subplots(
    nrows=n_samples, ncols=examples["class"].nunique(),
    figsize=(15, 2.5 * n_samples)
)

if axes.ndim == 1:
    axes = axes[None, :]  # ensure 2D

examples_leak = examples[examples["class"] == "LEAK"]
for row_idx, (_, row) in enumerate(examples_leak.iterrows()):
    y, sr = librosa.load(row["path"], sr=None, mono=True)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=64)
    S_db = librosa.power_to_db(S, ref=np.max)

    x_hp = highpass_filter(y, cutoff=150, fs=sr)
    x_hp = medfilt(x_hp, kernel_size=5)
    S_hp = librosa.feature.melspectrogram(y=x_hp, sr=sr, n_mels=64)
    S_db_hp = librosa.power_to_db(S_hp, ref=np.max)

    ax = axes[row_idx, 0]
    img = librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="mel", ax=ax)
    ax.set_title(f"Original, id: {row['id']}")
    fig.colorbar(img, ax=ax, format="%+2.0f dB")

    ax = axes[row_idx, 1]
    img = librosa.display.specshow(S_db_hp, sr=sr, x_axis="time", y_axis="mel", ax=ax)
    ax.set_title(f"Filtered, id: {row['id']}")
    fig.colorbar(img, ax=ax, format="%+2.0f dB")

plt.tight_layout()
plt.show()

  plt.show()


# 4.2 Framing and Feature Extraction

### Why Framing?
- Audio signals are **non-stationary**: their frequency content changes over time.
- To capture this, we divide the signal into **short overlapping frames** (e.g., 25 ms with 10 ms hop).
- On each frame we extract features, resulting in a time series of descriptors.

![](assets/frames.png)

### Common Frame-Based Features
- **Energy / RMS**: how strong the signal is in the frame.
- **Zero Crossing Rate (ZCR)**: noisiness of the frame.
- **Spectral Centroid**: brightness (where the spectrum "center of mass" lies).
- **MFCCs**: compact representation of the spectrum, widely used in speech/audio ML.

In [8]:
audio_filename = df_train.iloc[0]["path"]
x, fs = librosa.load(audio_filename, sr=None, mono=True)

# Parameters for framing
frame_length = int(0.25 * fs)  # 250 ms
hop_length = int(0.10 * fs)    # 100 ms

# Compute frame-level features
rms = librosa.feature.rms(y=x, frame_length=frame_length, hop_length=hop_length)[0]
zcr = librosa.feature.zero_crossing_rate(y=x, frame_length=frame_length, hop_length=hop_length)[0]
centroid = librosa.feature.spectral_centroid(y=x, sr=fs, n_fft=frame_length, hop_length=hop_length)[0]
mfcc = librosa.feature.mfcc(y=x, sr=fs, n_mfcc=13, n_fft=frame_length, hop_length=hop_length)

# Time axis for frames
times = librosa.frames_to_time(np.arange(len(rms)), sr=fs, hop_length=hop_length)

# Plot example: RMS, ZCR, Centroid
fig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)

axes[0].plot(times, rms, label="RMS Energy")
axes[0].set_ylabel("Energy")
axes[0].legend()

axes[1].plot(times, zcr, label="Zero Crossing Rate", color="orange")
axes[1].set_ylabel("ZCR")
axes[1].legend()

axes[2].plot(times, centroid, label="Spectral Centroid", color="green")
axes[2].set_ylabel("Hz")
axes[2].set_xlabel("Time (s)")
axes[2].legend()

plt.suptitle("Frame-Based Feature Extraction")
plt.tight_layout()
plt.show()

# Optional: plot MFCCs as heatmap
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfcc, sr=fs, hop_length=hop_length, x_axis="time")
plt.colorbar(label="MFCC Coefficients")
plt.title("MFCCs per Frame")
plt.show()

  plt.show()
  plt.show()


# 5. 4Fluid Audio Features

We provide a set of utilities in `fluid.sdk_audio.preprocessing` to handle common audio preprocessing tasks.
Below are the basic methods used in the pipeline:

1. `AudioUtil.readingFile(path)`
- Loads the audio waveform and sampling rate.
- Returns: `(waveform, sampling_rate)`.

2. `CropUtil.crop(x, crop_start, crop_end, duration, fs)`
- Crops the audio signal to a fixed segment.
- Parameters:
  - `crop_start`: crop first samples in seconds.
  - `crop_end`: crop last samples in seconds.
  - `duration`: length of the segment to keep (seconds). Currently, have no effect on the output.
  - `fs`: sampling rate.
- Useful to standardize input length or focus on a specific segment.

3. `ResampleUtil.resample(x, fs, target_fs)`
- Resamples the waveform to a new sampling rate (`target_fs`).
- Ensures all audio samples have a consistent sampling rate for feature extraction.

4. Mean removal and noise injection
- `x = x - np.mean(x)`: removes the DC offset (centers the signal).
- `x = x + 1e-10 * np.random.rand(x.size)`: adds tiny noise to avoid numerical issues (e.g., zero-only signals).

---

# Original vs. Processed Audio

Below, we visualize the effect of preprocessing by plotting the waveform **before and after**.

In [9]:
from fluid.sdk_audio.preprocessing.audio import AudioUtil
from fluid.sdk_audio.preprocessing.crop import CropUtil
from fluid.sdk_audio.preprocessing.resample import ResampleUtil

x_orig, fs_orig = AudioUtil.readingFile("assets/leakage_barks.wav")
x_proc = CropUtil.crop(x_orig, 1, 0, 10, fs_orig)

fs_proc = 8000
x_proc = ResampleUtil.resample(x_proc, fs_orig, fs_proc)
x_proc = x_proc - np.mean(x_proc)
x_proc = x_proc + 1e-10 * np.random.rand(x_proc.size)

fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=False)

axes[0].plot(np.linspace(0, len(x_orig)/fs_orig, len(x_orig)), x_orig)
axes[0].set_title("Original Audio")
axes[0].set_xlabel("Time (s)")
axes[0].set_ylabel("Amplitude")

axes[1].plot(np.linspace(0, len(x_proc)/fs_proc, len(x_proc)), x_proc)
axes[1].set_title("Processed Audio")
axes[1].set_xlabel("Time (s)")
axes[1].set_ylabel("Amplitude")

plt.tight_layout()
plt.show()

  plt.show()


After the initial audio preparation stage, the best segments are extracted using the frames with the lowest energy and highest similarity as the selection criterion. That is, considering the stationary nature of leaks, selecting lower-energy frames tends to eliminate external noise interference. The similarity criterion, on the other hand, promotes the selection of frames with similar characteristics, even if they are not consecutive.

In [10]:
from fluid.sdk_audio.processing.prefilter import Prefilter, PrefilterConfig

filter_config = PrefilterConfig(prefilterDict={
    "similarityFactor" : 0.250000,
    "percOfMinimalFrames" : 0.375000,
    "windowSegmInMSec" : 100,
    "freqCut2D" : 0.500000,
    "powFacEmph" : 2.000000,
    "withOversmooth" : 0
})

outputs = Prefilter.filter(filter_config, x_proc, fs_proc, return_dict=True)
x_filtered = outputs["x_filtered"]

fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=False)

axes[0].plot(np.linspace(0, len(x_proc)/fs_proc, len(x_proc)), x_proc)
axes[0].set_title("Processed Audio (Crop/ Mean Removal)")
axes[0].set_xlabel("Time (s)")
axes[0].set_ylabel("Amplitude")

axes[1].plot(np.linspace(0, len(x_filtered)/fs_proc, len(x_filtered)), x_filtered)
axes[1].set_title("Processed Audio (Filtered)")
axes[1].set_xlabel("Time (s)")
axes[1].set_ylabel("Amplitude")

plt.tight_layout()
plt.show()


fig, axes = plt.subplots(1, 2, figsize=(12, 6), sharex=False)

S_o = librosa.feature.melspectrogram(y=x_proc, sr=fs_proc, n_fft=1024, hop_length=256, n_mels=128)
S_o = librosa.power_to_db(S_o, ref=np.max)
img = librosa.display.specshow(S_o, sr=fs_proc, x_axis="time", y_axis="mel", ax=axes[0])
axes[0].set_title(f"Prefilter input")
fig.colorbar(img, ax=axes[0], format="%+2.0f dB")

S_f = librosa.feature.melspectrogram(y=x_filtered, sr=fs_proc, n_fft=1024, hop_length=256, n_mels=128)
S_f = librosa.power_to_db(S_f, ref=np.max)
img = librosa.display.specshow(S_f, sr=fs_proc, x_axis="time", y_axis="mel", ax=axes[1])
axes[1].set_title(f"Prefilter output")
fig.colorbar(img, ax=axes[1], format="%+2.0f dB")

plt.tight_layout()
plt.show()

  plt.show()
  plt.show()


In [11]:
M = outputs["prefilter_correlation"]
fig = plt.figure(figsize=(8,8))
plt.imshow(abs(M), vmin=np.min(M), vmax=np.max(M), cmap="gray")
plt.axis('off')
plt.show()


  plt.show()


Now that we have the processed audio and the filtered one, the next step is to use them for the feature extraction. The SDK uses a predefined setup where each input is used for a dedicated feature extraction (collections). The filtered audio is used to get the features by frame and to calculate the pitch. The other one is used to calculate 'pipe' features from the entire spectrogram.

In [12]:
from fluid.sdk_audio.features.recipe_features import RecipeFeatures

feature_config = {
    "inputs": {
        "frames": "frames",
        "pipes": "frame_raw",
        "pda": "frame_filtered"
    },
    "outputs": {
        "frameFactor": 9
    },
    "collections": {
        "frames": {
            "audio": {
                "audio_fs": fs_proc,
                "audio_wms": 1000,
                "audio_frameFactor": 9,
                "audio_window": "hanning"
            },
            "features": {
                "FeatureLPC": {"coefficients": 16},
                "FeatureEnergy": {},
                "FeatureZCR": {},
                "FeatureSNR": {},
                "FeatureClipness": {},
                "FeatureCepstrum": {"coefficients": 16},
                "FeaturePLP": {"coefficients": 16},
                "FeatureMFCC": {"coefficients": 16},
                "FeatureBassRatio": {"fc": 1600},
                "FeatureBandWidth": {"pbw": 0.2},
                "FeatureRollOff": {"factor": 0.85},
                "FeatureSpectralStd": {},
                "FeatureSpectralSpread": {},
                "FeatureSpectralFlow": {},
                "FeatureSpectralIrregularity": {"pbw": 0.2},
                "FeatureSpectralFlatness": {},
                "FeatureSpectralCentroid": {},
                "FeaturePeakness": {"diffdb": 3, "fneigh": 6},
                "FeatureBandRatio": {"coefficients": 16},
                "FeaturePeaks": {"numpks": 4, "fneigh": 6}
            }
        },
        "pipes": {
            "audio": {
                "audio_fs": fs_proc
            },
            "features": {
                "FeaturePipe": {"wms": 100, "freqCut2D": 0.75, "openradius": 10, "rng": 0.33, "rpk": 5}
            }
        },
        "pda": {
            "audio": {
                "audio_fs": fs_proc
            },
            "features": {
                "FeaturePDA": {"w": fs_proc, "fmax": 4000, "fmin": 80, "wms": 1000, "frameFactor": 9, "fneigh": 6}
            }
        }
    }
}


r = RecipeFeatures(x_proc, x_filtered, feature_config, parallel=False)
m, keys = r.extract()

df_feature = pd.DataFrame(m, columns=keys)
df_feature

Unnamed: 0,lpc1,lpc2,lpc3,lpc4,lpc5,lpc6,lpc7,lpc8,lpc9,lpc10,...,Scwspec_bw,Gpeakness,Speakness,Gstdbandratio,Sstdbandratio,Gfd,Sfd,GPfd,Gfo,Sfo
0,-3.553907,5.164565,-3.939156,1.639013,-0.532347,0.821705,-0.908842,0.270421,-0.262626,1.006698,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
1,-3.578645,5.281738,-4.17389,1.870772,-0.609614,0.741797,-0.776836,0.192061,-0.292193,1.130231,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
2,-3.520043,5.11049,-4.031212,1.901794,-0.767496,0.944948,-0.920349,0.214973,-0.264547,1.113155,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
3,-3.553417,5.252416,-4.232714,2.005471,-0.769134,0.927575,-0.841128,-0.056479,0.213285,0.625925,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
4,-3.624443,5.47655,-4.462384,2.035755,-0.585653,0.610662,-0.571568,-0.217819,0.431996,0.303681,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
5,-3.626502,5.475476,-4.485389,2.180558,-0.899338,0.986148,-0.951353,0.300567,-0.264862,1.020705,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
6,-3.641787,5.556204,-4.547866,2.004908,-0.532704,0.650427,-0.664866,0.010727,0.01521,0.79702,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
7,-3.678492,5.703144,-4.77361,2.14429,-0.489167,0.504839,-0.5527,-0.003911,-0.06556,0.907923,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329
8,-3.69123,5.735599,-4.812457,2.186168,-0.559728,0.665217,-0.794455,0.21622,-0.239469,1.082648,...,5.031697,526.111111,25.002222,0.159244,0.002296,142.833333,79.842032,130.074857,368.438596,70.899329


# 6. 4Fluid Audio Classification

Now, let's build some helper functions to iterate over a provided dataset and extract the features. To save time, the features will be stored in a temporary folder and preloaded if they exist.

In [13]:
import os


def extract_features(filename: str, pre_filter_config: dict, recipe_feature_config: dict) -> pd.DataFrame:
    x_orig, fs_orig = AudioUtil.readingFile(filename)
    x_proc = CropUtil.crop(x_orig, 1, 0, 10, fs_orig)

    fs_proc = 8000
    x_proc = ResampleUtil.resample(x_proc, fs_orig, fs_proc)
    x_proc = x_proc - np.mean(x_proc)
    x_proc = x_proc + 1e-10 * np.random.rand(x_proc.size)

    x_filtered = Prefilter.filter(pre_filter_config, x_proc, fs_proc, return_dict=False)

    r = RecipeFeatures(x_proc, x_filtered, recipe_feature_config, parallel=False)
    m, keys = r.extract()
    df_feature = pd.DataFrame(m, columns=keys)
    return df_feature


def extract_features_from_dataset(output_path: str, df: pd.DataFrame, pre_filter_config: dict, recipe_feature_config: dict) -> pd.DataFrame:
    features = []
    for idx, row in df.iterrows():
        filename = os.path.join(output_path, f"{row['id']}.parquet")

        if os.path.exists(filename):
            df_feature = pd.read_parquet(filename)
        else:
            df_feature = extract_features(row["path"], pre_filter_config, recipe_feature_config)
            df_feature["id"] = row["id"]
            df_feature["class"] = row["class"]
            df_feature.to_parquet(filename)

        features.append(df_feature)

    df_dataset = pd.concat(features, ignore_index=True)
    return df_dataset


output_path = os.path.join(os.getcwd(), "tmp", "audio_features")
os.makedirs(output_path, exist_ok=True)

train_path = os.path.join(output_path, "train")
os.makedirs(train_path, exist_ok=True)

test_path = os.path.join(output_path, "test")
os.makedirs(test_path, exist_ok=True)

df_train_features = extract_features_from_dataset(train_path, df_train, filter_config, feature_config)
df_train_features.to_parquet(os.path.join(output_path, "train.parquet"))

df_test_features = extract_features_from_dataset(test_path, df_test, filter_config, feature_config)
df_test_features.to_parquet(os.path.join(output_path, "test.parquet"))

In [14]:
from fluid.models.mixture.plot import plot_dispersion
from fluid.models.mixture.mixture_model import MixtureModel

params = {
    "Classifier": {
        "factDenSTD": 1.000000e-03,
        "distBase": "mahalanobis",
        "scoreNormMethod": "gaussian",
        "pound": 1
    }
}

mm = MixtureModel(params)

X_train = df_train_features.drop(columns=["id", "class"]).values
y_train = df_train_features["class"].values
feature_names = df_train_features.drop(columns=["id", "class"]).columns.tolist()

mm.train(X_train, y_train, feature_names)

fig = plot_dispersion(mm)
fig.show()

In [15]:
sample_id = df_test["id"].iloc[0]
df_features = df_test_features[df_test_features["id"] == sample_id]

X_test = df_features.drop(columns=["id", "class"]).values
output = mm.process(X_test)
display(output)

{'score': np.float64(73.35921550240818),
 'class': 'LEAK',
 'fit': np.float64(1.7532296917322991)}

In [16]:
import plotly.graph_objects as go

output = mm.process(X_test, return_features=True)
normalized_features = output["normalized_features"]

for frame in range(normalized_features.shape[0]):
    fig.add_trace(go.Scatter(
        name=f"Frame {frame+1}",
        marker=dict(color="black"),
        y=normalized_features[frame, :],
        hoverinfo='text'
    ))

fig.show()

In [17]:
output = mm.process(X_test, return_dists=True)

fig = go.Figure()

fig.add_trace(go.Scatter(
    name=f"Dist LEAK",
    marker=dict(color="red"),
    y=output["distCV"],
    hoverinfo='text'
))

fig.add_trace(go.Scatter(
    name=f"Dist LESS",
    marker=dict(color="blue"),
    y=output["distSV"],
    hoverinfo='text'
))

fig.show()

In [18]:
from sklearn.metrics import confusion_matrix

y_true = []
y_pred = []

for sample_id, df_features in df_test_features.groupby("id"):
    X_test = df_features.drop(columns=["id", "class"]).values
    output = mm.process(X_test)

    expected_output = df_features["class"].values[0]
    print(f"Predict {sample_id}: class {output['class']} score: {output['score']}, expected: {expected_output}")

    y_true.append(1 if expected_output == "LEAK" else 0)
    y_pred.append(1 if output["class"] == "LEAK" else 0)

tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel().tolist()

print(f"Confusion matrix: \nTN: {tn}, FP: {fp}, FN: {fn}, TP: {tp}")

Predict 01407475-4a35-4ed0-8f67-b759f9e18e72: class LESS score: 55.69586717507045, expected: LESS
Predict 022374f1-cd32-4a40-935f-f0ea168a5f71: class LEAK score: 64.68747610912285, expected: LEAK
Predict 057bdff0-b0b3-495d-914e-aced5d8f64c7: class LESS score: 57.1820989038919, expected: LESS
Predict 058b3360-4181-473a-8783-f0a27aa5ce08: class LEAK score: 56.85648556330576, expected: LEAK
Predict 067fb1bd-0491-4d2d-8c01-4e16e5ff4cbd: class LESS score: 57.43984198200739, expected: LESS
Predict 075f2ee2-6235-4edb-a866-4b7e93e012f2: class LESS score: 52.86195955651729, expected: LESS
Predict 09753cbd-2201-4e21-bd83-eadba3e14130: class LEAK score: 73.35921550240818, expected: LEAK
Predict 0b44b12a-1ec7-4377-aa0b-7156942c346f: class LEAK score: 56.468853675994325, expected: LEAK
Predict 0fcc33a0-9d6a-4ac6-abdf-ff0a558160d6: class LEAK score: 50.05835009618455, expected: LESS
Predict 1115a1ac-212d-44d4-9ff5-81ad5133ad6a: class LESS score: 53.102606849371504, expected: LESS
Predict 16a4a49c-7a

# Stevan - Extrator de features básico

In [19]:
# Extrator de features de áudio (32D) + exemplo 
import numpy as np
import librosa

def extract_basic_audio_features(
    path: str,
    sr: int = 16000,
    n_fft: int = 1024,
    hop_length: int = 256,
    n_mfcc: int = 13,
) -> np.ndarray:
    """
    Vetor de 32 dimensões:
      [ RMS_mean, RMS_std, -> Média e Desvio
        ZCR_mean, ZCR_std, -> Média e Desvio
        Centroid_mean, Centroid_std, -> Média e Desvio
        MFCC0_mean..MFCC12_mean (13), -> Média x13
        MFCC0_std..MFCC12_std   (13) ]
    """
    # 1) Carrega o dado
    y, sr_loaded = librosa.load(path, sr=sr, mono=True)

    # 2) Magnitude para RMS, potência para centroid
    S_mag = np.abs(librosa.stft(y, n_fft=n_fft, hop_length=hop_length))
    S_pow = S_mag**2

    # 3) Features básicas por frame
    rms = librosa.feature.rms(S=S_mag, frame_length=n_fft, hop_length=hop_length)[0]
    zcr = librosa.feature.zero_crossing_rate(y, frame_length=n_fft, hop_length=hop_length)[0]
    centroid = librosa.feature.spectral_centroid(S=S_pow, sr=sr_loaded)[0]
    mfcc = librosa.feature.mfcc(y=y, sr=sr_loaded, n_mfcc=n_mfcc,
                                n_fft=n_fft, hop_length=hop_length)

    # 4) Função + Vetor (mean/std)
    def stats(x: np.ndarray) -> np.ndarray:
        return np.array([np.mean(x), np.std(x)], dtype=np.float32)

    vec = np.concatenate([
        stats(rms),                  # 2
        stats(zcr),                  # 2
        stats(centroid),             # 2
        np.mean(mfcc, axis=1),       # 13
        np.std(mfcc, axis=1),        # 13
    ]).astype(np.float32)            # total = 32

    return vec

# Exemplo direto com arquivo 
audio_path = "/home/stevan/Documentos/4fluid-ia/data/audios/2025_db1/3026c4d9-78ff-4dde-8034-fb2c18d5100f.wav"

feat_vec = extract_basic_audio_features(audio_path)
print("Vetor de features:", feat_vec.shape)   # esperado: (32,)
print("Primeiras 10 dims:", np.round(feat_vec[:10], 5))

Vetor de features: (32,)
Primeiras 10 dims: [ 7.8000000e-04  1.3600000e-03  1.8110999e-01  6.8719998e-02
  1.3192579e+03  5.5941486e+02 -6.3025739e+02  4.5285141e+01
 -5.1715321e+01  6.9185001e-01]


# Pipeline 2 - Extração de Features

In [20]:
# Pipeline 2 de extração de features de áudio 
from __future__ import annotations
from dataclasses import dataclass
from typing import Dict, Tuple, List, Union, Optional
from pathlib import Path
import numpy as np
import librosa


# -----------------------------
# Configuração
# -----------------------------
@dataclass
class FeatureConfig:
    # pré-processamento
    sr: int = 16000
    mono: bool = True
    offset: float = 0.0
    duration: Optional[float] = None
    trim_silence: bool = True
    trim_top_db: float = 30.0
    normalize_peak: bool = True
    pre_emphasis: bool = True
    pre_emph_coef: float = 0.97

    # STFT
    n_fft: int = 1024
    hop_length: int = 256
    win_length: Optional[int] = None
    window: str = "hann"

    # Mel / MFCC
    n_mels: int = 64
    fmin: float = 20.0
    fmax: Optional[float] = None
    n_mfcc: int = 13
    mfcc_use_deltas: bool = True    

    # Seleção de features
    use_logmel: bool = True
    use_mfcc: bool = True
    use_chroma: bool = True
    use_contrast: bool = True
    use_tonnetz: bool = False        # útil para sinais tonais (musicais)
    use_rms: bool = True
    use_zcr: bool = True
    use_centroid_bw_rolloff: bool = True
    use_flatness: bool = True
    use_tempo: bool = True

    # Agregação
    aggregate: bool = True
    aggregate_stats: Tuple[str, ...] = ("mean", "std", "median", "iqr", "min", "max")

    # Segurança numérica
    eps: float = 1e-10


# -----------------------------
# Funções utilitárias
# -----------------------------
def _pre_emphasis(y: np.ndarray, coef: float = 0.97) -> np.ndarray:
    if y.size < 2:
        return y
    y_out = np.empty_like(y)
    y_out[0] = y[0]
    y_out[1:] = y[1:] - coef * y[:-1]
    return y_out

def _iqr(x: np.ndarray, axis: int = -1) -> np.ndarray:
    q75 = np.nanpercentile(x, 75, axis=axis)
    q25 = np.nanpercentile(x, 25, axis=axis)
    return q75 - q25

def _aggregate_matrix(F: np.ndarray, name: str, stats: Tuple[str, ...]) -> Tuple[np.ndarray, List[str]]:
    """Aggrega (D, T) -> concat de stats por eixo 1 (tempo)."""
    F = np.atleast_2d(F)
    vec_parts, names = [], []
    for stat in stats:
        if stat == "mean":
            vals = np.nanmean(F, axis=1)
        elif stat == "std":
            vals = np.nanstd(F, axis=1)
        elif stat == "median":
            vals = np.nanmedian(F, axis=1)
        elif stat == "iqr":
            vals = _iqr(F, axis=1)
        elif stat == "min":
            vals = np.nanmin(F, axis=1)
        elif stat == "max":
            vals = np.nanmax(F, axis=1)
        else:
            raise ValueError(f"Estatística '{stat}' não suportada.")
        vec_parts.append(vals)
        names.extend([f"{name}_{i}_{stat}" for i in range(F.shape[0])])
    return np.concatenate(vec_parts, axis=0).astype(np.float32), names


# -----------------------------
# Extrator principal
# -----------------------------
def extract_audio_features_advanced(
    y_or_path: Union[str, np.ndarray],
    cfg: FeatureConfig = FeatureConfig(),
    return_maps: bool = True,
) -> Tuple[Dict[str, np.ndarray], np.ndarray, List[str]]:
    """
    Retorna:
      - features_dict: mapas 2D (n_feats x T) por chave
      - feature_vector: vetor 1D com agregações
      - feature_names: nomes alinhados ao vetor
    """
    # 1) Carregar o áudio
    if isinstance(y_or_path, str):
        y, sr_loaded = librosa.load(
            y_or_path, sr=cfg.sr, mono=cfg.mono, offset=cfg.offset, duration=cfg.duration
        )
    else:
        y = np.asarray(y_or_path, dtype=np.float32)
        sr_loaded = cfg.sr if cfg.sr is not None else 22050
        if cfg.mono and y.ndim > 1:
            y = librosa.to_mono(y)

    # 2) Trim e normalização
    if cfg.trim_silence:
        y, _ = librosa.effects.trim(y, top_db=cfg.trim_top_db)
    if cfg.normalize_peak and np.max(np.abs(y)) > 0:
        y = y / (np.max(np.abs(y)) + cfg.eps)
    if cfg.pre_emphasis:
        y = _pre_emphasis(y, coef=cfg.pre_emph_coef)

    # 3) STFT
    win_length = cfg.win_length or cfg.n_fft
    S_complex = librosa.stft(
        y, n_fft=cfg.n_fft, hop_length=cfg.hop_length, win_length=win_length, window=cfg.window
    )
    S_mag = np.abs(S_complex)
    S_pow = (S_mag ** 2)
    features_dict: Dict[str, np.ndarray] = {}

    # 4) Log-Mel
    if cfg.use_logmel:
        mel = librosa.feature.melspectrogram(
            S=S_pow, sr=sr_loaded, n_mels=cfg.n_mels, fmin=cfg.fmin, fmax=cfg.fmax
        )
        logmel = np.log(mel + cfg.eps)
        features_dict["logmel"] = logmel

    # 5) MFCC 
    if cfg.use_mfcc:
        mfcc = librosa.feature.mfcc(
            y=y, sr=sr_loaded, n_mfcc=cfg.n_mfcc, n_fft=cfg.n_fft, hop_length=cfg.hop_length
        )
        if cfg.mfcc_use_deltas:
            d1 = librosa.feature.delta(mfcc, order=1)
            d2 = librosa.feature.delta(mfcc, order=2)
            mfcc = np.vstack([mfcc, d1, d2])  # shape: (n_mfcc*3, T)
        features_dict["mfcc"] = mfcc

    # 6) Chroma / Contrast / Tonnetz (Outras opções do librosa porém com foco Musical)
    if cfg.use_chroma:
        features_dict["chroma"] = librosa.feature.chroma_stft(S=S_pow, sr=sr_loaded, hop_length=cfg.hop_length)
    if cfg.use_contrast:
        features_dict["contrast"] = librosa.feature.spectral_contrast(S=S_pow, sr=sr_loaded, fmin=cfg.fmin)
    if cfg.use_tonnetz:
        y_h = librosa.effects.harmonic(y)
        features_dict["tonnetz"] = librosa.feature.tonnetz(y=y_h, sr=sr_loaded)

    # 7) Outras features
    if cfg.use_rms:
        features_dict["rms"] = librosa.feature.rms(S=S_mag, frame_length=cfg.n_fft, hop_length=cfg.hop_length)
    if cfg.use_zcr:
        features_dict["zcr"] = librosa.feature.zero_crossing_rate(y, frame_length=win_length, hop_length=cfg.hop_length)
    if cfg.use_centroid_bw_rolloff:
        features_dict["centroid"]  = librosa.feature.spectral_centroid(S=S_pow, sr=sr_loaded)
        features_dict["bandwidth"] = librosa.feature.spectral_bandwidth(S=S_pow, sr=sr_loaded)
        features_dict["rolloff"]   = librosa.feature.spectral_rolloff(S=S_pow, sr=sr_loaded)
    if cfg.use_flatness:
        features_dict["flatness"] = librosa.feature.spectral_flatness(S=S_pow)

    # 8) Tempo (BPM)
    if cfg.use_tempo:
        try:
            tempo, _ = librosa.beat.beat_track(y=y, sr=sr_loaded, hop_length=cfg.hop_length)
            tempo = float(tempo)
        except Exception:
            tempo = np.nan
        T = next(iter(features_dict.values())).shape[1] if features_dict else S_mag.shape[1]
        features_dict["tempo"] = np.full((1, T), tempo, dtype=np.float32)

    # 9) Agregação -> vetor fixo
    feat_vec_list, feat_names = [], []
    if cfg.aggregate and features_dict:
        for k, v in features_dict.items():
            vec_k, names_k = _aggregate_matrix(v, k, cfg.aggregate_stats)
            feat_vec_list.append(vec_k)
            feat_names.extend(names_k)
        feature_vector = np.concatenate(feat_vec_list, axis=0).astype(np.float32)
    else:
        feature_vector = np.array([], dtype=np.float32)
        feat_names = []

    # 10) Retorno
    if not return_maps:
        # Se não quiser mapas, só retorna dict vazio + vetor
        return {}, feature_vector, feat_names
    return features_dict, feature_vector, feat_names


# -----------------------------
# Transformar em DataFrame + padronização 
# -----------------------------
def features_to_dataframe(feature_vector: np.ndarray, feature_names: List[str], tag: str = "sample_0"):
    if pd is None:
        raise ImportError("pandas não disponível. Instale `pandas` para usar este helper.")
    df = pd.DataFrame([feature_vector], index=[tag], columns=feature_names)
    return df

def standardize_features(X: np.ndarray) -> Tuple[np.ndarray, Optional[StandardScaler]]:
    if StandardScaler is None:
        print("⚠️ sklearn não disponível. Retornando X sem padronizar.")
        return X, None
    scaler = StandardScaler()
    Xs = scaler.fit_transform(X)
    return Xs, scaler

print("Pipeline pronto: use `extract_audio_features_advanced(path_or_wave, cfg)`")


Pipeline pronto: use `extract_audio_features_advanced(path_or_wave, cfg)`


In [21]:
# usando o pipeline 
from pathlib import Path
from IPython.display import Audio, display

audio_path = "/home/stevan/Documentos/4fluid-ia/data/audios/2025_db1/3026c4d9-78ff-4dde-8034-fb2c18d5100f.wav"
cfg = FeatureConfig(
    sr=16000,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    n_mfcc=13,
    mfcc_use_deltas=True,      # MFCC 
    use_logmel=True,
    use_mfcc=True,
    use_chroma=True,
    use_contrast=True,
    use_tonnetz=False,         # ativar se o sinal for musical
    use_rms=True,
    use_zcr=True,
    use_centroid_bw_rolloff=True,
    use_flatness=True,
    use_tempo=True,
    aggregate=True,
    aggregate_stats=("mean","std","median","iqr","min","max"),
)

maps, vec, names = extract_audio_features_advanced(audio_path, cfg, return_maps=True)
print(f"✅ vetor agregado: {vec.shape} dimensões")
print("alguns nomes:", names[:10])
print("mapas disponíveis:", {k: v.shape for k, v in maps.items()})

if Path(audio_path).exists():
    display(Audio(audio_path))


✅ vetor agregado: (774,) dimensões
alguns nomes: ['logmel_0_mean', 'logmel_1_mean', 'logmel_2_mean', 'logmel_3_mean', 'logmel_4_mean', 'logmel_5_mean', 'logmel_6_mean', 'logmel_7_mean', 'logmel_8_mean', 'logmel_9_mean']
mapas disponíveis: {'logmel': (64, 622), 'mfcc': (39, 622), 'chroma': (12, 622), 'contrast': (7, 622), 'rms': (1, 622), 'zcr': (1, 622), 'centroid': (1, 622), 'bandwidth': (1, 622), 'rolloff': (1, 622), 'flatness': (1, 622), 'tempo': (1, 622)}



Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)



# Extrair Vetores para df_train e df_test:

In [22]:
# Extrair vetores do extrator a partir dos dataframes ===
from pathlib import Path
import numpy as np
from sklearn.preprocessing import LabelEncoder

# 1) Wrapper: usa o extrator AVANÇADO e retorna APENAS o vetor agregado
def feature_fn_advanced_only_vector(path: str) -> np.ndarray:
    cfg = FeatureConfig(
        sr=16000, n_fft=1024, hop_length=256,
        n_mels=64, n_mfcc=13, mfcc_use_deltas=True,
        use_logmel=True, use_mfcc=True, use_chroma=True, use_contrast=True,
        use_tonnetz=False, use_rms=True, use_zcr=True,
        use_centroid_bw_rolloff=True, use_flatness=True, use_tempo=True,
        aggregate=True,
        aggregate_stats=("mean","std","median","iqr","min","max"),
    )
    _, vec, _ = extract_audio_features_advanced(path, cfg, return_maps=False)
    return vec.astype(np.float32)

def batch_extract_vectors(df, path_col="path"):
    X_list, keep_idx = [], []
    for i, (idx, row) in enumerate(df.iterrows(), start=1):
        p = str(row[path_col])
        if not Path(p).exists():
            print(f"⚠️ arquivo não encontrado, pulando: {p}")
            continue
        try:
            v = feature_fn_advanced_only_vector(p)
            X_list.append(v)
            keep_idx.append(idx)
        except Exception as e:
            print(f"⚠️ erro ao extrair de {p}: {e}")
        if i % 50 == 0:
            print(f"... processados {i}/{len(df)}")
    X = np.vstack(X_list).astype(np.float32) if X_list else np.zeros((0, ), dtype=np.float32)
    return X, df.loc[keep_idx].reset_index(drop=True)

# Extração em lote
X_train, df_train_ok = batch_extract_vectors(df_train, path_col="path")
X_test,  df_test_ok  = batch_extract_vectors(df_test,  path_col="path")

# Rótulos (encode para inteiros)
le = LabelEncoder()
y_train = le.fit_transform(df_train_ok["class"].values)
y_test  = le.transform(df_test_ok["class"].values)

print(f"✅ X_train: {X_train.shape} | y_train: {y_train.shape} | classes: {list(le.classes_)}")
print(f"✅ X_test:  {X_test.shape}  | y_test:  {y_test.shape}")



Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)



... processados 50/640
... processados 100/640
... processados 150/640
... processados 200/640
... processados 250/640
... processados 300/640
... processados 350/640



Trying to estimate tuning from empty frequency set.



... processados 400/640
... processados 450/640
... processados 500/640
... processados 550/640
... processados 600/640



Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)



... processados 50/192
... processados 100/192
... processados 150/192
✅ X_train: (640, 774) | y_train: (640,) | classes: ['LEAK', 'LESS']
✅ X_test:  (192, 774)  | y_test:  (192,)


# Treinar, avaliar e salvar o classificador -> Regressão Logística

In [None]:
# Classificador: StandardScaler + Regressão Logística ===
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Treino
clf = Pipeline([
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
    ("logreg", LogisticRegression(
        max_iter=2000, class_weight="balanced", n_jobs=-1, solver="lbfgs", random_state=42
    )),
])
clf.fit(X_train, y_train)

# Avaliação no TEST já definido pelo split DVC
y_pred = clf.predict(X_test)
print("=== Classification Report (TEST) ===")
print(classification_report(y_test, y_pred, target_names=list(le.classes_), digits=4))

# Matriz de confusão
labels_sorted = list(range(len(le.classes_)))
cm = confusion_matrix(y_test, y_pred, labels=labels_sorted)

fig, ax = plt.subplots(figsize=(5,4))
im = ax.imshow(cm, cmap="Blues")
ax.set_xticks(labels_sorted); ax.set_xticklabels(le.classes_, rotation=45, ha="right")
ax.set_yticks(labels_sorted); ax.set_yticklabels(le.classes_)
ax.set_xlabel("Predito"); ax.set_ylabel("Verdadeiro"); ax.set_title("Matriz de confusão")
for (i, j), v in np.ndenumerate(cm):
    ax.text(j, i, str(v), ha="center", va="center")
plt.colorbar(im); plt.tight_layout(); plt.show()

# Salvar pipeline + label encoder 
MODEL_PATH = "audio_classifier_logreg.joblib"
joblib.dump({"pipeline": clf, "label_encoder": le}, MODEL_PATH)
print(f"✅ Modelo salvo em: {MODEL_PATH}")

# Função para prever 1 arquivo novo usando o mesmo pipeline
def predict_single_audio(path: str):
    v = feature_fn_advanced_only_vector(path).reshape(1, -1)
    y_hat = clf.predict(v)[0]
    prob = clf.predict_proba(v)[0]
    return le.inverse_transform([y_hat])[0], dict(zip(le.classes_, prob))

=== Classification Report (TEST) ===
              precision    recall  f1-score   support

        LEAK     0.8936    0.8750    0.8842        96
        LESS     0.8776    0.8958    0.8866        96

    accuracy                         0.8854       192
   macro avg     0.8856    0.8854    0.8854       192
weighted avg     0.8856    0.8854    0.8854       192

✅ Modelo salvo em: audio_classifier_logreg.joblib



FigureCanvasAgg is non-interactive, and thus cannot be shown



In [None]:
# Exemplo:
pred_label, pred_proba = predict_single_audio(df_test_ok.loc[0, "path"])
pred_label, pred_proba


Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)



('LEAK',
 {'LEAK': np.float64(0.9999748344746593),
  'LESS': np.float64(2.5165525340675957e-05)})

Exception ignored in: <function ResourceTracker.__del__ at 0x7d7283d99da0>
Traceback (most recent call last):
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 77, in __del__
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x72d1eebadda0>
Traceback (most recent call last):
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 77, in __del__
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 86, in _stop
  File "/home/stevan/anaconda3/envs/4fluid/lib/python3.12/multiprocessing/resource_tracker.py", line 111, in _stop_locked
ChildProcessError: [Errno 1

In [27]:
# Ver o mapeamento:
print("Classes:", list(le.classes_))
print("Mapa classe->id:", dict(zip(le.classes_, le.transform(le.classes_))))


Classes: ['LEAK', 'LESS']
Mapa classe->id: {'LEAK': np.int64(0), 'LESS': np.int64(1)}


In [29]:
# Conferência rápida:
print(f"✅ X_train: {X_train.shape} | y_train: {y_train.shape} | classes: {list(le.classes_)}")
print(f"✅ X_test:  {X_test.shape}  | y_test:  {y_test.shape}")

✅ X_train: (640, 774) | y_train: (640,) | classes: ['LEAK', 'LESS']
✅ X_test:  (192, 774)  | y_test:  (192,)
