# An Overview of Perceptually Relevant Metrics of Audio Similarity for Potential Use as Loss In Training Neural Networks

There are several areas that need thorough exploration:
- conventional loss functions
   - equipped with some modification to improve their perceptual relevance
   - like pre-emphasis
- complex perception focused metrics
   - mostly focused on quality assessment
   - may need some modification as well
- trained models
   - least "safe" method but likely best performing *as long as the inputs are similar to training dataset*

General notes from the literature:

- Amos Tversky researched similarity, perception, and categorization from psychology point of view. He noted human perception does not satisfy the definition of a euclidean metric [[1]](#references).
- Large portion of "music similarity" research focuses on clustering music with the aim of content delivery optimization [[1]](#references).
- MFCC based distance may be helpful [[1]](#references) (not directly mentioned) and there should be a way to make mel-cepstral distance differentiable [[2]](#references).
- 

Personal comments:

- Before settling for any similarity metric we first need to decide, whether the compared signals have to be produced with the same input signal
   - A guitar player may be able to tell if two systems are similar (or the same) even when hearing two different riffs played through them
   - Metric which does not require the same input signals on both compared systems may be helpful in some cases


## Conventional Loss

- Pros
   - easy to use
   - easy to represent
   - works the same regardless of data
   - computationally not difficult
- Cons
   - perceptually irrelevant

Conventional loss, such as MSE, might be better suited for the task at hand when improved with some form of pre-emphasis.



## Perception Focused Metrics

- Pros
   - perceptually relevant
   - works the same regardless of data (mosly [[3]](#references))
- Cons
   - may not be easy to use
   - may not be easy to represent
   - can be computationally difficult

### Notes From Literature

- PEMO-Q [[3]](#references)
   - PEMO-Q was created for lossy compression evaluation [[3]](#references).
   - Attempts to answer the problem of doubts about PEAQ being a realistic and valid model of general auditory perception [[3]](#references).
   - Auditory model
      - The signals are preprocessed in a way that may not be suitable for our problem [[3]](#references).
      - After preprocessing, the signals are transformed into "internal representation" using an auditory signal processing model [[3]](#references).
         - 35-band gammatone filterbank (basilar membrane characteristics) with each band then processed individually
         - half-wave rectification and low pass filter at 1 kHz (transformation of mechanical oscillations to neural firing rates of the inner haircells)
         - absolute hearing threshold determined from assumed maximum signal input
         - sequence of five nonliear feedback loops
            - dividing element and a low-pass (RC)
            - input is divided by low passed output
         - linear modulation filterbank, the most significant diference rom previous work
            - a simplified version of PEMO-Q replaces this part with modulation-low-pass version of the auditory model (less accurate but less computationally difficult)
      - Lastly, cognitive effects are modeled in post-processing stage [[3]](#references).
   - Each channel of the outputs of auditory model are then cross correlated, which (after a weighed sum) gives a perceptual quality measure called PSM [[3]](#references).
   - PEMO-Q also defines a second, more detailed (in time) measure called PSM<sub>t</sub>, which is likely not of significance for our work [[3]](#references).
   - When using the computationally less demanding version, PEMO-Q is signal dependent [[3]](#references).
   - The correlation between the PSM and subjective ratings is higher than between PSM<sub>t</sub> and subjective ratings as long as only one type of signals is studied [[3]](#references).
   - According to the paper, it should be applicable more generally, but is not suitable for predicting impact of linear systems [[3]](#references).

- ViSQOLAudio [[4]](#references)
   - VISQOL is originally a speech quality model that was later modified to function as a method of perceptual evaluation of lossy compression [[4]](#references).
   - VISQOLAudio is an improved VISQOL (improed in respect to lossy compression evaluation) [[4]](#references).
   - The introduction of machine learning and output of MOS [[4]](#references) may mean original VISQOL could be better for our use case.
   - According to authors of the papers, VISQOLAudio is the first free and open source audio quality metric with accuracy comparable to proprietary metrics used in the industry [[4]](#references).
   - 
   - pg. 694 col. 1, sec. II.


## Trained Models

- Pros
   - potential for best performance
   - should be easy to use
- Cons
   - virtually impossible to represent
   - no guarantee of performance on unknown data

# References
- [1] [A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures](https://www.jstor.org/stable/3681827)
- [2] [Embedding a Differentiable MEL-Cepstral Synthesis Filter to a Neural Speech Synthesis System](https://arxiv.org/pdf/2211.11222)
- [3] [PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception](https://ieeexplore.ieee.org/document/1709880)
- [4] [Objective Assessment of Perceptual Audio Quality Using ViSQOLAudio](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7940042)

## Sources to go through

- [a] A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences
  - https://arxiv.org/abs/2001.04460
  - neural network trained on a large dataset of crowdsourced human judgements
  - implemented in TesorFlow at: https://github.com/pranaymanocha/PerceptualAudio?tab=readme-ov-file
- [b] Auditory Feature-based Perceptual Distance
  - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10925319/
  - not reviewed yet
  - comparison of several options
- [c] Audio retrieval based on perceptual similarity
  - https://ieeexplore.ieee.org/document/7014580
- [d] Modeling Perceptual Similarity of Audio Signals for Blind Source Separation Evaluation
  - https://www.researchgate.net/publication/220848024_Modeling_Perceptual_Similarity_of_Audio_Signals_for_Blind_Source_Separation_Evaluation
- [e] A Similarity Measure for Automatic Audio Classification
  - likely not too useful
  - https://cdn.aaai.org/Symposia/Spring/1997/SS-97-03/SS97-03-001.pdf
- [f] An Objective Metric of Human Subjective Audio Quality Optimized for a Wide Range of Audio Fidelities
  - https://ieeexplore.ieee.org/abstract/document/4358089
- [g] Music Popularity: Metrics, Characteristics, and Audio-Based Prediction
  - https://ieeexplore.ieee.org/abstract/document/8327835

# Help From Others

### Metody hodnoceni zvuku od Vaska


>Ahoj,
>
>tady jsou nejake linky na systemy hodnoceni kvality zvuku, ktere by se mohly dat aplikovat na Tvuj problem.
>
>https://github.com/google/visqol Tohle je system, ktery je podle clanku obstojny a asi jeden z poslednich, o kterych vim. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7940042
>
>Tohle je starsi system, ktery ma jednodussi impementaci, takze mozna pro zacatek by byl rozumnejsi https://ieeexplore.ieee.org/document/1709880
>
>Obecne ty systemy vyuzivaji banky filtru zvanou gammatone filterbank, ktera se da najit treba tady https://amtoolbox.org/models.php Ten zbytek algoritmu by se mel taky dat najit v tom toolboxu a vyhodnoceni uz je pomoci rovnic, co se daji snadno implementovat.
>
>
>Ja sam jsem nikdy prakticky ty systemy nepouzil. Prevzal jsem rozhodovaci cast z PEMO-Q pro jeden konferencni prispevek. Tam jsem nahradil banku filtru necim, co by melo byt vernejsi funkci sluchu, ale zase je to vypocetne nesrovnale narocnejsi. Takze bych sam spise zacal od tech zavedenych signal processingovych postupu v odkazech. Kdyby jsi o funkci sluchu chtel vedet vic, pak mam v doktorske etape predmet.
>
>Vasek
