# Digital Signal Processing and Deep-Learning

--------------------------------------------------------------------------------------

### Basics

- DSP/Audio Features for ML: [Valerio Velardo](https://www.youtube.com/watch?v=rlypsap6Wow&list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0&index=8)
- [DSP: Basics](https://support.ircam.fr/docs/AudioSculpt/3.0/co/Sampling.html) - Aliasing, Nyquist Freq.,lowest detectable frequency
- A signal sampled with a 32 KHz SR, any freq. components > 16 KHz (N.F.), we get an aliasing
- Nyquist Frequency and the relation between sampling-rate and max. frequency
    - $F_{max} = Sampling Rate/2$
    - $F_{max}$ is called _Nyquist Frequency_
- Wav length (size) of np array = SR * duration of clip

**Sound perception**: - [link](http://physics.bu.edu/~duffy/py105/Sound.html), [link](https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2013.00636.x)
- We perceive sound lograthimically
- **Weber-Fechner law**: Above a minimal threshold of perception $S_0$, perceived intensity $P$ is logarithmic to stimulus intensity $S$: $P = K. log (\frac{S}{S_0})$

### Time domain features

- Amplitude Envelope
- RMS Energy


### FFT

- $s(t) = A_1.sin(2\pi.f_1.t + \phi_1) + A_2.sin(2\pi.f_2.t + \phi_2) + \ldots{}$ 
- FFT: Y-axis: Magnitude, X-Axis: Freq.
- Plotting a sine wave using freqs. [link](https://stackoverflow.com/questions/22566692/python-how-to-plot-graph-sine-wave)
- When plotted - it shows a reflection around Nyquist Frequency = SR/2.
- Reflection = aliasing

### STFT

- Compute FFT at several intervals 
- Interval = Frame length
- Preserves time domain even though it is a freq. transformation!!
- SSFT gives **Spectorgram**
- 3 axes
- Y: Freq., X: Time, Color variation: Magnitude


### Mel Frequency Cepstral Coefficients (MFCC)

- Timbral/textural features of sound
- _Frequency_ domain feature
- Approx. _human_ auditory system
- 13-40 coeffs.
- Calc. at each _frame_
- Humans are able to detect piano vs violin playing same pitch/freq.

# CNN

1. Convolution = dot product
2. Center sq. replaced with value
3. Zero padding used to resolve the edge values
4. Kernels are **learnt**
5. Kernels help detect features - e.g. edges
6. KERNEL - Archtectural decision:
    - Grid size: Kernel i.e. 3x3, 5x5. Note odd size to get center.
    - Stride:    Kernel step size in pixels
    - Depth:     Mono = 1, RGB = 3
    - Number of kernels: 
7. Conv. layer output:
    - Conv. layer has **multiple** kernels
    - Each kernel outputs a 2D array (kernel .dot. image is also a 2D output)
    - Output from a layer = 2D arrays equal to no.of kernels used
8. Pooling output:
    - Downsample the image
    - Max./avg. pooling
    - No parameters unlike kernel. No learning of pooling op. reqd. - we use a simple max. or avg. - nothing to learn in that
9. POOLING - Archtectural decision:
    - Grid size: 2x2, 3x3 etc.
    - Stride
    - Type i.e. max. or avg. - _Generally max. used. Helps invariance_


10. **Architecture:**
```
    Input image -> {(Conv.layer+ReLU)->Pooling} -> {(Conv.layer+ReLU)->Pooling} ->  ... 
                  \__________________ FEATURE LEARNING _____________________________________/
                  
     ... Flatten output -> Fully connected layers ->  Softmax -> Classification output 
        \__________________ CLASSIFICATION  _________________/ 
```    

11. First layers = low level features (lines, circles). As we progress, abstract features (wheel, headlights)

12. Data shape for MFCCs:
    Example:
    - 13 features
    - hop length = 512
    - num. of samples in audio file = 51200
    
    Data shape = 100 x 13 x 1
    How many time-windows = full .wav length / hop-length = 51200/512 = 100
    Features extracted per time-window = 13
    Depth = 1
    
    Therefore: Data shape = 
        {(Full track length in no. of samples) / hop-length} 
            x {No. of MFCC features}
            x {depth of image. Mono=1, RGB=3}