In [None]:
# :: 16th December 2022 :: @7:25pm

Short-Time Fourier Transform
- Enables the extraction of spectrograms
    - Most useful feature for deep learning audio models.
- We will go from the frequency domain.
- This will be useful as the standard fourier transformation takes the frequencies across time as averages but not as a representation across time. It is more helpful to know the evolution of the audio signal across time.
- We apply FFT locally to small segments of the signal in this case. (Each frame)

STFT Intuition:
- Apply FFT to the current frame to get magnitude spectrum.
- Then apply it again and again to the whole duration of the signal.
- Derive these segments through windowing the signal.
$$
x_w(k) = x(k) \times w(k)
$$
- where w(k) = window function sample to obtain windowed signal
- Multiplying the whole original signal by the windowing function simply gives us the segment that we wanted without any more of the signal.


- Window size and frame size are both measured in samples.
    - Window size is the amount of samples we apply windowing to
    - Frame size is the number of samples we consider in each chunk of the when we pass the STFT to... usually window size = frame size
    - Sometimes the frame size is larger than the window size, however this doesn't always happen and usually coincides.
    - Eg in librosa the default window size is the window size:
    - If however
        - The window is smaller than the frame then we zero pad them.

- The frames overlap usually, thus we use a hop size (H)
    - Provides how many samples to slide to the right when we take a new frame.

From DFT to STFT:
$$
\hat(x)(k) = \sum_{n=0}^{N-1} x(n) \times e^{-i2\pi n \frac{k}{N}}
$$
- where k is a proxy / frequency ()
- N = whole signal
- x(n) = whole signal
$$
S(m,k) = \sum_{n=0}{N-1} x(n+mH) \times w(n) \times e^{-i2\pi n \frac{k}{N}}
$$
- S depends on m (proxy for time) and k (proxy for frequency)
- m, nominally is the frame number we are currently in.
    - eg m = 2 on frame 2.
- Two sums are summing across all the duration of the signal (N = all samples in signal in DFT meanwhile in STFT the N = current frame....)
- x(n + mH) where mH = starting sample of the current frame where H is the hop size and m = frame. n = moves from 0 onwards and covers all samples in the frame.
- Signal is multiplied by w(n) where it is the windowing function (chop all signal off except current windowed signal for that one frame)
- Last step is the same for both which multiplies by a pure tone and decompose and project the pure tone with frequency.


Outputs:
- DFT
    - Extract a spectral vector (for a number of frequency bins) we get a fourier coeff for each of the frequency components we decomposed our original signal into. This is a vector, a single dimensional array. No mention of time.
- STFT
    - We have a two dimensional array, spectral matrix with (# frequency bins, # frames), where the frames are proxies for time
    - Complex fourier coefficients.

- #Frequency bins (DFT freq bins = number of samples in whole signal, meanwhile STFT bins are for the current frame then Nyquist halved):
    - $\frac{framesize}{2}+1$
        - We don't need all the bin information in STFT so we / 2
- #Frames:
    - $\frac{samples - framesize}{hopsize} + 1$

Example STFT output:
 - Signal = 10k samples
 - Frame size = 1000
 - Hop size = 500

#Freq bins = 1000/2+1 = 501 freq bins
    - The freq range is (0hz, sampling rate/2)
#Frames = (10000-1000) / 500+1 = 19 frames

This this STFT output shape is a 2 dimensional array of (501, 19) -> freq,temporal


STFT Params:

Parameter #1 = Frame Size
    - Usually measured in samples (512,1024,2048...)
        - Needed as we can use FFT for power of 2 numbers. (More efficient)

Time / Frequency trade off
- Large frame size = larger freq resolution = lower time resolution
    - More freq bins = larger freq res
    - More samples means considering a larger amount of time thus lower time resolution

- Smaller frame size = smaller freq resolution = higher time resolution
    - Less time used, fourier transform calculated on smaller chunks of time.

- we solve this using heuristics and some trade-offs, depends on the problem you are solving.
- Onset detection (time resolution)
- De-noising (freq resolution + some time)


Parameter #2 = HOP SIZE:
- Number of samples we slide to the right when we want a new frame
- Power of two size (256, 512,1024...)
- Can be defined as a fraction of the frame size eg 1/2 K, 1/4 K, 1/8 K


Parameter #3 = Windowing function:
- STFT is not only a function of the signal itself but also of the windowing function that we choose. Different windowing functions will modulate the signal differently.
- Rectangle window function not used at all as it creates discontinuities on the edges.
- Usually use a bell shaped curve
    - Hann window


Hann Window equation:
$$
w(k) = 0.5 \times (1 - \cos(\frac{2\pi k}{K -1})), k = 1...K
$$
Apply a weight multiplication across thee sample to eventually taper the signal.

Visualising Sound:
$$
Y(m,k) = |S(m,k)|^2
$$
