# Spectral Modeling Synthesis

Papers read - 
1. A system for sound analysis based on Sinusoidal + stochastic decomposition(Prof. Xavier's thesis)<1,2,3,4>
2. Morphing guided by perception<5,6,7,8,12>
3. Creative Music<10,11>
4. Perception Studies using MDS<9,13,14>

***
#### Prof. Xaviers Thesis
***

- Motivation - To obtain **musically useful** intermediate representation for sound transformations by modelling the spectral characteristics of sound
- Underlying Assumption 
    $$ x = x_{sine} + x_{stochastic}$$
    Where, $x_{sine} = \Sigma_{i} A_{i}[n]sin(\omega_{i}[n] + \phi_{i}[n])$ is a sinusoid captured by time varying amplitude, frequency and phase and $x_{stochastic}$ is the stochastic(non-deterministic) component

What constitutes a **good** transformation? 
- Flexibility(ease of transformation)
- Computationally Efficient
- Should faithfully reproduce the original sound with as good quality as it can

![title](fig_1.PNG)

### Background on some Synthesis Techniques

- Historical Background - 
    1. Tape Recorders
    2. Analog tapes(Music Concrete)
    3. Digital
- Techniques borrowed from Speech Analysis - 
    1. Vocoder 
        - Modeling of speech by an excitation waveform(sound source) which is filtered(vocal tract)
        - Were able to obtain interesting sound effects(pitch modification, timbre morphing)
    2. Linear Predictive Coding 
        - Linear time varying filtering
    3. Phase Vocoder 
        - Representing signals by the short time phase and amplitude spectrum
        - Major motivation to move towards the Short Time Fourier Transform(STFT)
- Synthesis Methods - 
    1. LPC based synthesis 
        - wide variety of transformations because of the decomposition
        - works well when analyzed sounds have **clear formant structure**
    2. Analysis based synthesis - 
        1. Heterodyne filtering 
            - breaks input waveform into pseudoperiodic segments and then estimates the pitch of each pseudoperiodic segment
            - Similar to STFT, analyzes signal at multiple, evenly-spaced time points
            - [The Application of Heterodyne Filter Analysis and Linear Predictive Coding
using cSound's ADSYN, LPREAD, and LPRESON Opcodes](http://baguyos.tripod.com/DMPST.html)
        2. Phase Vocoder
            - Manipulate Temporal and Spectral Features independently(decouple them)
        3. Formant wave-function synthesis
            - Directly modelling the time domain amplitude
            - [Time Domain FoF](https://link.springer.com/chapter/10.1007/978-94-009-9091-3_21)
            - [Final Project: Formant-Wave-Function Synthesis](https://ccrma.stanford.edu/~mjolsen/220a/fp.html)
        4. VOSIM 
            - model with sinc pulses of variable amplitudes, delays
            - [paper](http://www.atiam.ircam.fr/wp-content/uploads/2011/12/AES_JAES_1978_Kaegi_VOSIM.pdf)
        5. Wavelet transform
            - Wavelets as analysis functions


### Short Time Fourier Transform

- Why perform analysis in the spectral domain?
    - Our ear is like a harmonic analyzer, thus spectral analysis mimics the bahaviour of the ear
    - Cochlea is likened to a set of narrow band pass filters, thus it performs some kind of FT
- How our ear is different?
    - Our ear obtains a **log scale spectrum** as opposed to the linear spectra obtained by conventional FT
    - Time and Frequency domain masking
    - Amplitude perception relative to frequency
- [Hearing and Perception](http://artsites.ucsc.edu/ems/music/tech_background/te-03/teces_03.html)

The STFT equation - 
    $X_{l}(k) := \Sigma_{n=0}^{N-1} w(n)x(n+lH)e^{-j\omega_{k}n}$  

2 important (controllable) parameters - 
1. Analysis window w(n)
    - Determines time vs frequency resolution
    - Want narrow main lobe, low side lobe
    - For phase detection, constant phase spectrum obtained by using symmetric window
2. Hop size H
    - Depends on sound characteristics

Why move on?
- Cannot manipulate sounds easily  

Treat this as an intermediate step to obtain a more flexible representation

### Sinusoidal Model

Model a signal as a sum of time varying sinusoids  
\begin{align}
&s(t) = \Sigma_{r=1}^{R} A_{r}(t) cos(\theta_{r}(t))\\
&\theta_{r}(t) = \int_{0}^{t} \omega_{\tau}(\tau)d \tau + \theta_{r}(0) + \phi_{r}
\end{align}  
Here, **R** is the number of sinusoidal components, **$A_{r}(t)$** is the instantaneous amplitude and **$\theta_{r}(t)$** is the instantaneous phase

![title](fig_2.png)

The main steps in the parameter extraction are - 

1. Spectral Peak Detection - 
    - Peak detection
        - Local maxima in the magnitude spectrum at each time frame
        - Filtering the maxima with some threshold measure
        - <Optional> For perceptual purposes, use knowledge of equal loudness contours
    - Peak interpolation
        - Return a better estimate for the frequency than the bin value
        - Fit a parabola to the frequency, and use the peak of parabola as estimate 
The output of this stem is the estimated magnitude, frequency and phase of the prominent peaks in the STFT for each time frame

![title](fig_3.png)

2. Spectral Peak Continuation - 
    - Map the peaks at the $(n-1)^{th}$ time frame to the $(n)^{th}$ time frame
    - Find the peak in the $(n)^{th}$ time frame which is closest in frequency in the previous frame
    - Possible approaches - heuristic(rule based), probabilistic(hmm)

![title](fig_4.png)

Once the parameters are obtained, the sound is synthesized by generating the sum of sinusoids for each time frame in the following way -  
\begin{align}
s^{l}(m) = \Sigma_{r=1}^{R} \hat A^{l}_{r}cos(m\hat \omega^{l}_{r} + \hat \phi^{l}_{r})
\end{align}

Sound Effects - Can be easily achieved by playing around with the obtained parameters(scaling, interpolation, filtering etc. ). For ease of manipulation, only the magnitude spectra is used as the ear is mainly sensitive to the spectral magnitude and not the phase

Why move on?
- Difficult to model **noise** with sinsusoids(need a large number)
- Because of the lack of noise modelling, the percieved quality is a bit artificial during transformations  
    This is motivation to model the noise in the signal as the next step

### Deterministic + Residual Model

How is the **deterministic** component different from the previous?
- As opposed to selecting any peak in the spectrum(like in the previous case), the deterministic models particularly model the partials in the sound
- Thus, each sinusoid is assumed to model a **quasi-sinusoidal** component(piecewise linear amplitude and frequency variation) as opposed to any kind of sound   

The **Residual** in this case is defined $ x_{original} - x_{deterministic} $. It usually models the energy that does not go into vibrations, or any component that is not inherently sinusoidal in nature.

The signal in this case is modelled as - 
\begin{align}
s(t) = \Sigma_{r=1}^{R} A_{r}(t) cos(\theta_{r}(t)) + e(t)\\
\end{align}
Here, e(t) is the residual

![title](fig_5.png)

Most of the steps in extracting the parameters for the deterministic model are the same as the previous model. But, since the sinusoids are restricted to be partials only here, there is a modification in the **Spectral Peak Continuation** process. Using a heuristic(rule based) and some prior knowledge about the nature of the sound(harmonic, frequency range etc. ), an algorithm is proposed which tracks only the clear and stable partials

The deterministic components are synthesized using the parameters obtained. The residual is obtained by subtracting this deterministic signal from the original signal.  
An easier alternative is to subtract the frequency spectra of the two signals and ignoring the phase(perceptually unimportant)

Why move on ? 
- Residual is not flexible for performing transformations  

This motivates to further study the residual, and approximate it with a model that can be easily played around with.

### Deterministic + Stochastic Model

Observations from the previous models - 
- Not necessary to preserve phase
- Can model the residual as some kind of stochastic signal  

Modelling the residual as a stochastic signal helps in easily transforming the signal  

![title](fig_6.png)

The representation obtained is similar to the previous case, just that the residual e(t) is modelled as a stochastic signal, thus allowing to write as the action of a Linear Time Variant system on white noise.  
\begin{align}
\hat e(t) = \int_{0}^{t}h(t,t-\tau)u(\tau)d \tau
\end{align}
Here, u(t) is white noise and h(t,t') is the filter.

![title](fig_7.png)

The deterministic component is calculated in the same way as the previous. The parameters are set in such a way as to extract the partials as accurately as possible(to prevent them from appearing in the residual)

Since we assume the residual to be a stochastic signal, it is characterized by its amplitude and frequency.   
To obtain the general shape of the residual spectrum, we approximate the envelope of the residual spectrum, which is obtained by subracting the deterministic spectra from the original spectra. This is because only the shape of the envelope contributes to the sound characteristics. The envelope is approximated by **curve fitting** or **LPC**.   
Once the envelope is obtained, we generate the stochastic signal by using this as our amplitude and generate random numbers as phase
\begin{align}
\hat e(t) = IFT(A(k)e^{j \Theta(k)})
\end{align}
Here, A(k) is the envelope, and $\Theta(k)$ is the phase(random)

Transformations - Can be separately applied to the deterministic and stochastic components. 
- Deterministic - Similar transformations like before
- Stochastic - Envelope shaping, filtering etc.   

![fig](fig_8.png)

### Examples of sound effects using the above model (Refer 4.pdf)
1. Filtering
2. Pitch Scaling, transposition and discretization
3. Vibrato, tremolo
4. Spectral shape shifting
5. Gender changing
6. Harmonizing
7. Hoarseness
8. Morphing

***
#### Musical Instrument Sound Morphing Guided by Perceptually Motivated Features
***
For sound examples, visit [this page](http://recherche.ircam.fr/anasyn/caetano/overview.html)

What is **Morphing**?  
- Blurring Distinction between **Source** and **Target**
- Somewhat like creating **hybrid** musical instruments  
- Would like to ideally perform **Perceptually Linear** transformations
- The morphed sound should not simply sound like a mixture of sounds(the ear can distinguish in such cases). It should rather sound like a single **entity**  

How is it done?

- Obtain some kind of reprsentation of the sound, and then have an interpolation function that gradually interpolates these representations from one sound to the other.  
- Control the whole morphing process(algorithmically and perceptually) with a single coefficient $\alpha$ , the interpolation factor
- You would ideally want to vary the interpolate the parameters so that the morphed sound vary **perceptually linearly**

In this work, the authors have proposed to seek sound parameters that favor Perceptually Linear transformations

Work done previously  
- Mostly interpolate parameters/features without caring much about the perceptual impact
![Classical Morphing](fig_9.PNG)  

What are parameters,features?
- Parameters - Coefficients obtained from sound analysis models(can resynthesize sound from them)
- Features - Particular aspects of sound  

Methods Used - 
1. Parameter interpolation using Wigner Distributions(Time Frequency)  
2. GMM models for parameter interpolation  
3. **Model Sounds as dynamical systems with ANN**  
4. Discrete Wavelet Transform(DWT) + Singular Value Decomposition(SVD)  

The above don't consider perceptual factors and suggest suggest interpolation strategies with better perceptual corelations, like the ones below
1. Dynamic Frequncy Warping(DFW) to morph spectral envelopes
2. Multi Dimensional Scaling(MDS)

One important thing to consider in all the above cases is the need for the sound to be **temporally alligned**, or else some kind of smearing might occur, thus making the resultant sound artificial to hear

In this work, as opposed to interpolating parameters directly, the authors propose to first obtain relevant features from the parameters(which might have a more perceptual meaning than the parameters themselves), and then interpolate these features itself.  
![Authors Proposed Idea](fig_10.PNG)
However, obtaining parameters from features is difficult(It is not a one-one transformation!). Thus, instead of this approach, the authors propose to use parameters for whom the interpolated sounds features are close to the interpolated feature values(suitable evaluation scheme suggested, use parameters -> feauture values vary linearly when interpolating linearly) 

The features the authors use in this work are obtained by finding **accoustic correlates of Timbre Spaces using MDS**(Essentially trying to mathematically describe the Timbre Space). The features are both temporal and spectral.  

Temporal  
1. log attack time
2. temporal centroid  

Spectral
1. spectral centroid
2. spectral spread
3. spectral skewness
4. spectral kurtosis

The authors proposed model - 
![Authors Proposed Model](fig_11.PNG)

Extraction of Parameter - 
1. Temporal Segmentation - Segment into ADSR
2. Temporal Allignment - Boundaries should coincide
3. Temporal Envelope Extraction - True Amplitude Envelope(TAE) based on cepstral smoothing
4. Sinusoidal + Residual Model - To obtain the parameters
5. Source Filter Model - 

Morphing - 
1. Spectral Envelope Morphing - Shift in frequency peaks smoothly
2. Interpolation of partial frequencies
3. Temporal Envelope Morphing


Evaluation
- Vienna Symphonic Library
- Listening Test 
    - Judge several Characteristics for each morph value
    - Complicated and very subjective
- Proposed Objective error function(assuming linearity, essentially the MSE)

***
#### Spectral Envelope Estimation and Representation for Sound Analysis–Synthesis
***

What is it - Envelope of magnitude of Short Time Spectrum(STS)
1. Speak about **Spectral Envelopes** in sound analysis and synthesis.
2. Linked to perception, and how they capture **important properties** of sound.
3. Challenges - Not east to estimate and represent them

What you want - 
    1. Envelope fit - Links the peaks of partials
    2. Smoothness - should not oscillate wildly, just give a rough idea of the shape
    3. Adaptation to fast spectral variations

Methods of Estimation - 
    1. Linear predictive coding
    2. Cepstrum
    3. Discrete cepstrum

What is wanted - 
    1. Precision
    2. Stability(like BIBO, small changes in data should give small changes in output)
    3. Locality in frequency - parameters should cause local, not global changes
    4. Flexibility and ease of manipulations - Should be easily manipulable with tunable parameters
    5. Speed of synthesis 
    6. Space in memory
    7. Manual input - manual tuning/control
    
Proposed Representations - 
    1. Filter coefficients
    2. Sampled representation
    3. Geomteric representation
    4. Formants

What can be done with them - 
    1. Influence timbre
    2. Enhance musical expressivity

Link to webpage describing the above in detail - [link](http://recherche.ircam.fr/anasyn/schwarz/publications/icmc1999/se99-poster.html)

***
#### AUTOMATIC TIMBRAL MORPHING OF MUSICAL INSTRUMENT SOUNDS BY HIGH-LEVEL DESCRIPTORS
***

Precursor to the second paper above.  
Speak about the importance of taking into account perception while morphing(go through paper highlights, most work is extension of 2

***
#### AUTOMATIC AUDIO MORPHING
***

Automatic morphing from one sound to another
- Sound represented in a multi-dimensional space
- Axes represent features that **perceptually represent** the sound, in this case spectral shape and pitch
- The axes are assumed to be orthogonal i.e. each axis can be transformed independent of the others
- Morphing essentially represents interpolation in this space

One sound should **smoothly** change into another sound. The process is described below in the figure - 
![fig](fig_12.PNG)
1. A sound representation in the 'space'(in this case pitch and envelope) is obtained.
2. Matching(allignment) to allign relevant features
3. Interpolation, followed by reconstruction

Spectrograms are used to encode the two axes(smooth spectrogram using MFCC's, followed by a pitch spectrogram)

The morphing - Linear interpolation between the matched features
![morphing](fig_13.PNG)
The signal is linearly interpolated in time from features of signal 1 to signal 2.

Future work - 
1. Better repesentations
2. Better matching techniques
3. **Perceptually optimal interpolation functions**

***
#### Creative Music(Work by Tristan Jehan at MIT Media Lab)
***

What has he done - In his Masters thesis<11>, He has made what he calls a **hyperviolin** - An instrument whose sounds/timbre can be modified in realtime by modifiying parameters. Then, in his PhD thesis<10>, he creates a system which can basically **create** music from data.

***
Msc Thesis - Audio Driven Timbre Generator
Salient points - 
- **Little to no digital synthesis technology for non keyboard instruments**{Potential Work area}
- Limitations in current systems are - 
    1. Quality
    2. Control over synthesis
    3. Instrument specific
- Hyperviolin - 
    1. Takes musical performance data(audio + gesture)
    2. Processing(can be in realtime)
    3. Generation/Synthesis
- They model **Physical Sound**, not the perceptual features(What they call **Timbre Model**)
- **Almost no work done on perceptually-controlled sound synthesis**{Potential Work area}
- Future work - 
    1. Algorithms that extract **better and new** perceptual features
    2. Study the **evolution of parameters** rather than instantaneuos parameters i.e. take into account how parameters evolve in the piece and transform accordingly

***
PhD Thesis - Creating Music by Listening
- High degree of abstraction between sampled audio and mental perception of it. Authors propose to bridge this gap by modeling human perception and learning of music.
- Composing new music by **recycling** pre-existing music.
- Thinks of all possible audio signals as a space. Music is a very small subset with some structure in this space. The authors want to make an **intelligent search** in this space to **re-discover** music.

Interesting points - 
- Perceptual Synthesis Engine(MSc thesis work, SMS + Perception) -> Decompose audio to parameters(frequency, amplitude) and their perceptual correlates(instantaneuos pitch, loudness, brightness), then learn relation between two data
- Use other metadata(besides audio) like acoustic, cultural editorial for retrieval{Can we do for synthesis, study what kinds of metadata is available}

**Music Cognition Machine**
![Proposed Framework](fig_14.PNG)