# An Overview of Perceptually Relevant Metrics of Audio Similarity for Potential Use as Loss In Training Neural Networks

There are several areas that need thorough exploration:
- conventional loss functions
   - equipped with some modification to improve their perceptual relevance
   - like pre-emphasis
- complex perception focused metrics
   - mostly focused on quality assessment
   - may need some modification as well
- trained models
   - least "safe" method but likely best performing *as long as the inputs are similar to training dataset*

General notes from the literature:

- Amos Tversky researched similarity, perception, and categorization from psychology point of view. He noted human perception does not satisfy the definition of a euclidean metric.
- Large portion of "music similarity" research focuses on clustering music with the aim of content delivery optimization. [1](#references)
- MFCC based distance may be helpful [1](#references) (not directly mentioned) and there should be a way to make mel-cepstral distance differentiable [2](#references)
- 


## Conventional Loss

- Pros
   - easy to use
   - easy to represent
   - works the same regardless of data
   - computationally not difficult
- Cons
   - perceptually irrelevant

Conventional loss, such as MSE, might be better suited for the task at hand when improved with some form of pre-emphasis.



## Perception Focused Metrics

- Pros
   - perceptually relevant
   - works the same regardless of data
- Cons
   - may not be easy to use
   - may not be easy to represent
   - can be computationally difficult



## Trained Models

- Pros
   - potential for best performance
   - should be easy to use
- Cons
   - virtually impossible to represent
   - no guarantee of performance on unknown data

# References
- [1] [A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures](https://www.jstor.org/stable/3681827)
- [2] [Embedding a Differentiable MEL-Cepstral Synthesis Filter to a Neural Speech Synthesis System](https://arxiv.org/pdf/2211.11222)