# Spectral Modeling Synthesis

- Motivation - To obtain **musically useful** intermediate representation for sound transformations by modelling the spectral characteristics of sound
- Underlying Assumption 
    $$ x = x_{sine} + x_{stochastic}$$
    Where, $x_{sine} = \Sigma_{i} A_{i}[n]sin(\omega_{i}[n] + \phi_{i}[n])$ is a sinusoid captured by time varying amplitude, frequency and phase and $x_{stochastic}$ is the stochastic(non-deterministic) component

What constitutes a **good** transformation? 
- Flexibility(ease of transformation)
- Computationally Efficient
- Should faithfully reproduce the original sound with as good quality as it can

![title](fig_1.png)

### Background on some Synthesis Techniques

- Historical Background - 
    1. Tape Recorders
    2. Analog tapes(Music Concrete)
    3. Digital
- Techniques borrowed from Speech Analysis - 
    1. Vocoder 
        - Modeling of speech by an excitation waveform(sound source) which is filtered(vocal tract)
        - Were able to obtain interesting sound effects(pitch modification, timbre morphing)
    2. Linear Predictive Coding 
        - Linear time varying filtering
    3. Phase Vocoder 
        - Representing signals by the short time phase and amplitude spectrum
        - Major motivation to move towards the Short Time Fourier Transform(STFT)
- Synthesis Methods - 
    1. LPC based synthesis 
        - wide variety of transformations because of the decomposition
        - works well when analyzed sounds have **clear formant structure**
    2. Analysis based synthesis - 
        1. Heterodyne filtering 
            - breaks input waveform into pseudoperiodic segments and then estimates the pitch of each pseudoperiodic segment
            - Similar to STFT, analyzes signal at multiple, evenly-spaced time points
            - [The Application of Heterodyne Filter Analysis and Linear Predictive Coding
using cSound's ADSYN, LPREAD, and LPRESON Opcodes](http://baguyos.tripod.com/DMPST.html)
        2. Phase Vocoder
            - Manipulate Temporal and Spectral Features independently(decouple them)
        3. Formant wave-function synthesis
            - Directly modelling the time domain amplitude
            - [Time Domain FoF](https://link.springer.com/chapter/10.1007/978-94-009-9091-3_21)
            - [Final Project: Formant-Wave-Function Synthesis](https://ccrma.stanford.edu/~mjolsen/220a/fp.html)
        4. VOSIM 
            - model with sinc pulses of variable amplitudes, delays
            - [paper](http://www.atiam.ircam.fr/wp-content/uploads/2011/12/AES_JAES_1978_Kaegi_VOSIM.pdf)
        5. Wavelet transform
            - Wavelets as analysis functions


### Short Time Fourier Transform

- Why perform analysis in the spectral domain?
    - Our ear is like a harmonic analyzer, thus spectral analysis mimics the bahaviour of the ear
    - Cochlea is likened to a set of narrow band pass filters, thus it performs some kind of FT
- How our ear is different?
    - Our ear obtains a **log scale spectrum** as opposed to the linear spectra obtained by conventional FT
    - Time and Frequency domain masking
    - Amplitude perception relative to frequency
- [Hearing and Perception](http://artsites.ucsc.edu/ems/music/tech_background/te-03/teces_03.html)

The STFT equation - 
    $X_{l}(k) := \Sigma_{n=0}^{N-1} w(n)x(n+lH)e^{-j\omega_{k}n}$  

2 important (controllable) parameters - 
    1. Analysis window w(n)
        - Determines time vs frequency resolution
        - Want narrow main lobe, low side lobe
        - For phase detection, constant phase spectrum obtained by using symmetric window
    2. Hop size H
        - Depends on sound characteristics

Why move on?
    - Cannot manipulate sounds easily
Treat this as an intermediate step to obtain a more flexible representation

### Sinusoidal Model

Model a signal as a sum of time varying sinusoids  
\begin{align}
&s(t) = \Sigma_{r=1}^{R} A_{r}(t) cos(\theta_{r}(t))\\
&\theta_{r}(t) = \int_{0}^{t} \omega_{\tau}(\tau)d \tau + \theta_{r}(0) + \phi_{r}
\end{align}  
Here, **R** is the number of sinusoidal components, **$A_{r}(t)$** is the instantaneous amplitude and **$\theta_{r}(t)$** is the instantaneous phase

![title](fig_2.png)

The main steps in the parameter extraction are - 

1. Spectral Peak Detection - 
    - Peak detection
        - Local maxima in the magnitude spectrum at each time frame
        - Filtering the maxima with some threshold measure
        - <Optional> For perceptual purposes, use knowledge of equal loudness contours
    - Peak interpolation
        - Return a better estimate for the frequency than the bin value
        - Fit a parabola to the frequency, and use the peak of parabola as estimate 
The output of this stem is the estimated magnitude, frequency and phase of the prominent peaks in the STFT for each time frame

![title](fig_3.png)

2. Spectral Peak Continuation - 
    - Map the peaks at the $(n-1)^{th}$ time frame to the $(n)^{th}$ time frame
    - Find the peak in the $(n)^{th}$ time frame which is closest in frequency in the previous frame
    - Possible approaches - heuristic(rule based), probabilistic(hmm)

![title](fig_4.png)

Once the parameters are obtained, the sound is synthesized by generating the sum of sinusoids for each time frame in the following way -  
\begin{align}
s^{l}(m) = \Sigma_{r=1}^{R} \hat A^{l}_{r}cos(m\hat \omega^{l}_{r} + \hat \phi^{l}_{r})
\end{align}

Sound Effects - Can be easily achieved by playing around with the obtained parameters(scaling, interpolation, filtering etc. ). For ease of manipulation, only the magnitude spectra is used as the ear is mainly sensitive to the spectral magnitude and not the phase

Why move on?
- Difficult to model **noise** with sinsusoids(need a large number)
- Because of the lack of noise modelling, the percieved quality is a bit artificial during transformations  
    This is motivation to model the noise in the signal as the next step

### Deterministic + Residual Model

How is the **deterministic** component different from the previous?
    - As opposed to selecting any peak in the spectrum(like in the previous case), the deterministic models particularly model the partials in the sound
    - Thus, each sinusoid is assumed to model a **quasi-sinusoidal** component(piecewise linear amplitude and frequency variation) as opposed to any kind of sound  
The **Residual** in this case is defined $ x_{original} - x_{deterministic} $. It usually models the energy that does not go into vibrations, or any component that is not inherently sinusoidal in nature.

The signal in this case is modelled as - 
\begin{align}
s(t) = \Sigma_{r=1}^{R} A_{r}(t) cos(\theta_{r}(t)) + e(t)\\
\end{align}
Here, e(t) is the residual

![title](fig_5.png)

Most of the steps in extracting the parameters for the deterministic model are the same as the previous model. But, since the sinusoids are restricted to be partials only here, there is a modification in the **Spectral Peak Continuation** process. Using a heuristic(rule based) and some prior knowledge about the nature of the sound(harmonic, frequency range etc. ), an algorithm is proposed which tracks only the clear and stable partials

The deterministic components are synthesized using the parameters obtained. The residual is obtained by subtracting this deterministic signal from the original signal.  
An easier alternative is to subtract the frequency spectra of the two signals and ignoring the phase(perceptually unimportant)

Why move on ? 
    - Residual is not flexible for performing transformations
This motivates to further study the residual, and approximate it with a model that can be easily played around with.

### Deterministic + Stochastic Model

Observations from the previous models - 
    - Not necessary to preserve phase
    - Can model the residual as some kind of stochastic signal
Modelling the residual as a stochastic signal helps in easily transforming the signal
![title](fig_6.png)

The representation obtained is similar to the previous case, just that the residual e(t) is modelled as a stochastic signal, thus allowing to write as the action of a Linear Time Variant system on white noise.  
\begin{align}
\hat e(t) = \int_{0}^{t}h(t,t-\tau)u(\tau)d \tau
\end{align}
Here, u(t) is white noise and h(t,t') is the filter.

![title](fig_7.png)

The deterministic component is calculated in the same way as the previous. The parameters are set in such a way as to extract the partials as accurately as possible(to prevent them from appearing in the residual)

Since we assume the residual to be a stochastic signal, it is characterized by its amplitude and frequency.   
To obtain the general shape of the residual spectrum, we approximate the envelope of the residual spectrum, which is obtained by subracting the deterministic spectra from the original spectra. This is because only the shape of the envelope contributes to the sound characteristics. The envelope is approximated by **curve fitting** or **LPC**.   
Once the envelope is obtained, we generate the stochastic signal by using this as our amplitude and generate random numbers as phase
\begin{align}
\hat e(t) = IFT(A(k)e^{j \Theta(k)})
\end{align}
Here, A(k) is the envelope, and $\Theta(k)$ is the phase(random)

Transformations - Can be separately applied to the deterministic and stochastic components. 
    - Deterministic - Similar transformations like before
    - Stochastic - Envelope shaping, filtering etc. 
![fig](fig_8.png)