# FAD Evaluation of RAVE Model

This notebook provides an **objective evaluation pipeline** for audio generated by the **RAVE model** using the **Frechet Audio Distance (FAD) toolkit**.  
It quantifies the similarity between RAVE-generated audio and reference data from the training set, enabling a reproducible, quantitative assessment of model performance.

---

## Features
- Generate **505 audio files** with the Max/MSP RAVE patcher.  
- Extract generated outputs (`rave_outputs.zip`) and training dataset (`wav_files_5h.zip`).  
- Build a **reference set** from the original training data.  
- Standardize all audio formatting:
  - Trim to **4 seconds**
  - Resample to **44.1 kHz**
  - Convert to **mono, 16-bit PCM**  
- Compute **FAD scores** with Microsoft’s [FAD toolkit](https://arxiv.org/abs/2311.01616).  
- Compare RAVE outputs against the training dataset baseline.  

---

## Dependencies
- Python 3.x  
- [PyTorch](https://pytorch.org/)  
- [NumPy](https://numpy.org/)  
- [FFmpeg](https://ffmpeg.org/) (for preprocessing audio files)  
- [FAD Toolkit](https://github.com/microsoft/fadtk)  

Install requirements:
```bash
pip install torch torchvision fadtk ffmpeg-python


## Evaluation Procedure

### Generate Files
A total of **505 audio files** are generated using the **Max/MSP patcher** provided in the [freesound-loop-generator repository](https://github.com/AdaSalvadorAvalos/freesound-loop-generator/blob/main/inference/inference_simple.maxpat) with the [pretrained model](https://github.com/AdaSalvadorAvalos/freesound-loop-generator/tree/main/model_checkpoint).


In [5]:
!unzip rave_outputs.zip

Archive:  rave_outputs.zip
   creating: rave_outputs/
  inflating: rave_outputs/49_22_07.wav  
  inflating: __MACOSX/rave_outputs/._49_22_07.wav  
  inflating: rave_outputs/39_26_04.wav  
  inflating: __MACOSX/rave_outputs/._39_26_04.wav  
  inflating: rave_outputs/39_28_01.wav  
  inflating: __MACOSX/rave_outputs/._39_28_01.wav  
  inflating: rave_outputs/28_21_02.wav  
  inflating: __MACOSX/rave_outputs/._28_21_02.wav  
  inflating: rave_outputs/26_10_01.wav  
  inflating: __MACOSX/rave_outputs/._26_10_01.wav  
  inflating: rave_outputs/21_51_05.wav  
  inflating: __MACOSX/rave_outputs/._21_51_05.wav  
  inflating: rave_outputs/49_51_02.wav  
  inflating: __MACOSX/rave_outputs/._49_51_02.wav  
  inflating: rave_outputs/33_22_07.wav  
  inflating: __MACOSX/rave_outputs/._33_22_07.wav  
  inflating: rave_outputs/39_10_05.wav  
  inflating: __MACOSX/rave_outputs/._39_10_05.wav  
  inflating: rave_outputs/33_51_02.wav  
  inflating: __MACOSX/rave_outputs/._33_51_02.wav  
  inflating: rav

In [8]:
!unzip wav_files_5h.zip

Archive:  wav_files_5h.zip
   creating: wav_files_5h/
  inflating: wav_files_5h/4183027052167_processed.wav  
  inflating: wav_files_5h/19900834300_processed.wav  
  inflating: wav_files_5h/1219871067227_processed.wav  
  inflating: wav_files_5h/21052109347_processed.wav  
  inflating: wav_files_5h/58662518_processed.wav  
  inflating: wav_files_5h/2326122518_processed.wav  
  inflating: wav_files_5h/3591465968459_processed.wav  
  inflating: wav_files_5h/2051845941_processed.wav  
  inflating: wav_files_5h/128672600982_processed.wav  
  inflating: wav_files_5h/4773159497060_processed.wav  
  inflating: wav_files_5h/4772899497060_processed.wav  
  inflating: wav_files_5h/19921534300_processed.wav  
  inflating: wav_files_5h/1959481951924_processed.wav  
  inflating: wav_files_5h/2129251676145_processed.wav  
  inflating: wav_files_5h/1492034083_processed.wav  
  inflating: wav_files_5h/241309421331_processed.wav  
  inflating: wav_files_5h/18362597763_processed.wav  
  inflating: wav_f

In [7]:
!find rave_outputs -type f -name "*.wav" | wc -l

     505


### Create Reference Set
The **training dataset** used to train the RAVE model serves as the reference set.  
All samples are verified for consistent processing and correct alignment.  
Each file is trimmed to **4 seconds** and checked uniform audio settings:  
- Sampling rate: **44,100 Hz**  
- Channels: **Mono**  
- Bit depth: **16-bit** 

In [10]:
!find wav_files_5h/ -type f -name "*.wav" | wc -l

    1125


In [2]:
!ffmpeg -i rave_outputs/10_51_05.wav

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with clang version 14.0.6
  configuration: --prefix=/Volumes/CrucialX9/conda_envs/eval_fad --cc=x86_64-apple-darwin13.4.0-clang --ar=x86_64-apple-darwin13.4.0-ar --nm=x86_64-apple-darwin13.4.0-nm --ranlib=x86_64-apple-darwin13.4.0-ranlib --strip=x86_64-apple-darwin13.4.0-strip --disable-doc --enable-swresample --enable-swscale --enable-openssl --enable-libxml2 --enable-libtheora --enable-demuxer=dash --enable-postproc --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libdav1d --enable-zlib --enable-libaom --enable-pic --enable-shared --disable-static --disable-gpl --enable-version3 --disable-sdl2 --enable-libopenh264 --enable-libopus --enable-libmp3lame --enable-libopenjpeg --enable-libvorbis --enable-pthreads --enable-libtesseract --enable-libvpx --enable-librsvg
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    6

In [3]:
!ffmpeg -i wav_files_5h/2205351_processed.wav

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with clang version 14.0.6
  configuration: --prefix=/Volumes/CrucialX9/conda_envs/eval_fad --cc=x86_64-apple-darwin13.4.0-clang --ar=x86_64-apple-darwin13.4.0-ar --nm=x86_64-apple-darwin13.4.0-nm --ranlib=x86_64-apple-darwin13.4.0-ranlib --strip=x86_64-apple-darwin13.4.0-strip --disable-doc --enable-swresample --enable-swscale --enable-openssl --enable-libxml2 --enable-libtheora --enable-demuxer=dash --enable-postproc --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libdav1d --enable-zlib --enable-libaom --enable-pic --enable-shared --disable-static --disable-gpl --enable-version3 --disable-sdl2 --enable-libopenh264 --enable-libopus --enable-libmp3lame --enable-libopenjpeg --enable-libvorbis --enable-pthreads --enable-libtesseract --enable-libvpx --enable-librsvg
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    6

In [5]:
%%bash
mkdir -p trimmed

for f in wav_files_5h/*.wav; do
  fname=$(basename "$f")
  ffmpeg -y -hide_banner -loglevel error -i "$f" -t 4.0 "trimmed/$fname"
done



### Frechet Audio Evaluation

The FAD toolkit is run on the merged folder of generated audio against the reference dataset to obtain quantitative similarity scores.

In [None]:
!pip install torch torchvision fadtk

In [1]:
!fadtk

usage: fadtk [-h] [-w WORKERS] [-s SOX_PATH] [--inf] [--indiv]
             {clap-2023,clap-laion-audio,clap-laion-music,vggish,MERT-v1-95M-1,MERT-v1-95M-2,MERT-v1-95M-3,MERT-v1-95M-4,MERT-v1-95M-5,MERT-v1-95M-6,MERT-v1-95M-7,MERT-v1-95M-8,MERT-v1-95M-9,MERT-v1-95M-10,MERT-v1-95M-11,MERT-v1-95M,encodec-emb,encodec-emb-48k,w2v2-base-1,w2v2-base-2,w2v2-base-3,w2v2-base-4,w2v2-base-5,w2v2-base-6,w2v2-base-7,w2v2-base-8,w2v2-base-9,w2v2-base-10,w2v2-base-11,w2v2-base,w2v2-large-1,w2v2-large-2,w2v2-large-3,w2v2-large-4,w2v2-large-5,w2v2-large-6,w2v2-large-7,w2v2-large-8,w2v2-large-9,w2v2-large-10,w2v2-large-11,w2v2-large-12,w2v2-large-13,w2v2-large-14,w2v2-large-15,w2v2-large-16,w2v2-large-17,w2v2-large-18,w2v2-large-19,w2v2-large-20,w2v2-large-21,w2v2-large-22,w2v2-large-23,w2v2-large,hubert-base-1,hubert-base-2,hubert-base-3,hubert-base-4,hubert-base-5,hubert-base-6,hubert-base-7,hubert-base-8,hubert-base-9,hubert-base-10,hubert-base-11,hubert-base,hubert-large-1,hubert-large-2,hubert-lar

In [None]:
!fadtk clap-laion-music trimmed/ rave_outputs/ --inf

[Frechet Audio Distance] Loading 1125 audio files...
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
Loading HTSAT-base model config.
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initializ

The FAD clap-laion-music score between trimmed/ and rave_outputs/ is: 0.3591982714843721


In [None]:
!fadtk clap-laion-audio trimmed/ rave_outputs/ --inf

The FAD clap-laion-audio score between trimmed/ and rave_outputs/ is: 0.3631657854401499    

In [None]:
!fadtk encodec-emb trimmed/ rave_outputs/ --inf

The FAD encodec-emb score between trimmed/ and rave_outputs/ is: 102.95102670897386  

In [None]:
!fadtk MERT-v1-95M-4 trimmed/ rave_outputs/ --inf

The FAD MERT-v1-95M-4 score between trimmed/ and rave_outputs/ is: 5.876220112794796   