# FAD Evaluation of the CMT Model

This notebook provides an **objective evaluation pipeline** for audio generated by the **Conditioned Morpher Transformer (CMT)** using the **Frechet Audio Distance (FAD) toolkit**.  
It measures the similarity between generated morphed audio and reference data, yielding a quantitative assessment of model performance.

---

## Features
- Generate morphed audio files using the inference script in gradient mode.  
- Clean up intermediate `.dac` files after decoding.  
- Aggregate and organize generated files for evaluation.  
- Standardize audio formatting:
  - Trim to **4 seconds**
  - Resample to **44.1 kHz**
  - Convert to **mono, 16-bit PCM**  
- Compute **FAD scores** with Microsoft’s [FAD toolkit](https://arxiv.org/abs/2311.01616).  
- Summarize results across experimental conditions.

---

## Dependencies
- Python 3.x  
- [PyTorch](https://pytorch.org/)  
- [NumPy](https://numpy.org/)  
- [FFmpeg](https://ffmpeg.org/) (for audio preprocessing)  
- [FAD Toolkit](https://github.com/microsoft/fadtk)  

Install requirements:
```bash
pip install torch torchvision fadtk ffmpeg-python



## Evaluation Procedure

### Generate Files
The inference script is used in **gradient mode** to generate morphed files for morph ratios from `0.1` to `0.9` for a total of `X` source–target pairs.

### Clean up intermediate files
Intermediate `.dac` encoded files are removed after generation:

In [4]:
!rm ../eval_0_1/decoded/*.dac

### Organize results
All generated WAV files from the different gradient steps are merged into a single folder:

In [5]:
%%bash
mkdir -p eval_files_morph_all
find .. -type d -path "../eval_*/decoded" | while read dir; do
  find "$dir" -type f -exec cp {} eval_files_morph_all/ \;
done

In [7]:
!find eval_files_morph_all/ -type f | wc -l

     506


### Ensure Consistency
All samples are verified for consistent processing and correct alignment.  
Each file is trimmed to **4 seconds** and checked uniform audio settings:  
- Sampling rate: **44,100 Hz**  
- Channels: **Mono**  
- Bit depth: **16-bit**  

In [None]:
!pip install ffmpeg torch torchvision fadkt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


In [8]:
%%bash
mkdir -p trimmed_morph_all

for f in eval_files_morph_all/*.wav; do
  fname=$(basename "$f")
  ffmpeg -y -hide_banner -loglevel error -i "$f" -t 4.0 "trimmed_morph_all/$fname"
done

In [9]:
!ffmpeg -i trimmed_morph_all/50992518_to_51022518_morph_0.20.wav

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with clang version 14.0.6
  configuration: --prefix=/Volumes/CrucialX9/conda_envs/eval_fad --cc=x86_64-apple-darwin13.4.0-clang --ar=x86_64-apple-darwin13.4.0-ar --nm=x86_64-apple-darwin13.4.0-nm --ranlib=x86_64-apple-darwin13.4.0-ranlib --strip=x86_64-apple-darwin13.4.0-strip --disable-doc --enable-swresample --enable-swscale --enable-openssl --enable-libxml2 --enable-libtheora --enable-demuxer=dash --enable-postproc --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libdav1d --enable-zlib --enable-libaom --enable-pic --enable-shared --disable-static --disable-gpl --enable-version3 --disable-sdl2 --enable-libopenh264 --enable-libopus --enable-libmp3lame --enable-libopenjpeg --enable-libvorbis --enable-pthreads --enable-libtesseract --enable-libvpx --enable-librsvg
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    6

In [10]:
!ffmpeg -i trimmed_morph_all/50992518_processed.wav

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with clang version 14.0.6
  configuration: --prefix=/Volumes/CrucialX9/conda_envs/eval_fad --cc=x86_64-apple-darwin13.4.0-clang --ar=x86_64-apple-darwin13.4.0-ar --nm=x86_64-apple-darwin13.4.0-nm --ranlib=x86_64-apple-darwin13.4.0-ranlib --strip=x86_64-apple-darwin13.4.0-strip --disable-doc --enable-swresample --enable-swscale --enable-openssl --enable-libxml2 --enable-libtheora --enable-demuxer=dash --enable-postproc --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libdav1d --enable-zlib --enable-libaom --enable-pic --enable-shared --disable-static --disable-gpl --enable-version3 --disable-sdl2 --enable-libopenh264 --enable-libopus --enable-libmp3lame --enable-libopenjpeg --enable-libvorbis --enable-pthreads --enable-libtesseract --enable-libvpx --enable-librsvg
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    6

### Compute FAD
The FAD toolkit is run on the organized folder of generated audio against the trimmed reference dataset to obtain quantitative similarity scores.

In this evaluation, the reference set is the training set of the RAVE model (trimmed).

In [None]:
!fadtk clap-laion-music trimmed/ trimmed_morph_all/ --inf

The FAD clap-laion-music score between trimmed/ and trimmed_morph_all/ is: 0.5206492398755211 

In [None]:
!fadtk clap-laion-audio trimmed/ trimmed_morph_all/ --inf

The FAD clap-laion-audio score between trimmed/ and trimmed_morph_all/ is: 0.5694514270029016 

In [None]:
!fadtk encodec-emb trimmed/ trimmed_morph_all/ --inf

The FAD encodec-emb score between trimmed/ and trimmed_morph_all/ is: 34.40105169976695  

In [None]:
!fadtk MERT-v1-95M-4 trimmed/ trimmed_morph_all/ --inf

The FAD MERT-v1-95M-4 score between trimmed/ and trimmed_morph_all/ is: 8.934735599474658 