# Interactive Spectrograms

In [2]:
# Do the imports #
##################
#
%matplotlib inline
import os,sys 
import numpy as np
import pandas as pd
from IPython.display import display, Audio, HTML
#
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

try:
  import pyspch
except:
  ! pip install git+https://github.com/compi1234/pyspch.git
import pyspch.spg as specg
import pyspch.audio as audio
import pyspch.utils as spchu
import pyspch.io.timit as tio
import matplotlib.pyplot as plt
import pyspch.display as spchd
import pyspch.interactive

# make notebook cells stretch over the full screen
display(HTML(data="""
<style>
    div#notebook-container    { width: 95%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 99%; }
</style>
"""))

### Purpose and Background
The interactive spectrogram visualizes speech in the time-frequency domain.
Some form of time-frequency analysis is the first processing step in the human auditory system in equally so
in speech recognition systems.

Possible spectral representations are:   
**1. Fourier spectrogram**  
  A Fourier Spectrogram is obtained by letting a sliding window make short time spectra and by viewing this in a 2D heatmap
we may see which frequencies are present at which moment in time.  
**2. mel spectrogram**  
  The mel spectrogram applies warping on the frequency axis in line with the human auditory system.
Roughly speaking the frequency axis is linear below 1kHz and logarithmically compressed above it.
Today this is the most popular feature representation for speech recognition.  
**3. MFCCs (mel frequency cepstral coefficients)**. 
  Mel frequency cepstral coefficients are obtained by applying a DFT to the mel spectrum, optionally followed by truncation
to a handful of coefficients. 
MFCCs are popular because almost all information is concentrated in a a handful of low order coefficients, making them the most
compact possible speech representation.  Moreover MFCCs are highly uncorrelated, making them well suited for mathematical modeling.
While MFCCs have little to offer when abundant data / compute power is available (as is common these days),
they are still interesting in compact systems. 
 


### Instructions
- In default mode, you start the interactive spectrogram, by calling it without any parameters
> iSpectrogram()

may need to call the iSpectrogram routines with different parameters, that better suit your computer terminal
- size is a percentage of the max display size that is possible with your current notebook setup
- the dpi parameter controls the granularity of the plot and to some extent the size of the plot vs. the controls as well


#### File Input
Suggested Files to choose from ( 'https://homes.esat.kuleuven.be/~spchlab/data/'):
- misc/friendly.wav  ... a 1 second speech fragment
- misc/train.wav     ... a train whistle
- timit/si1027.wav   ... an example sentence from the TIMIT corpus

#### Segmentations
For the example speech files a number of segmentations are available (not all for each example). You can display them by entering the filename in the appropriate field.
They just have different extensions: ".gra" for grapheme or letter ,
".phn" for phone, ".syl" for syllable and ".wrd" for word

#### Visualization details
Normally you shouldn't have to worry about these settings.  On most displays visualization will be fine for screen/window sizes on the order of 10-24 inch.  If on your display you observe a bad mismatch between character sizes in the UI and
in the figures, then you can try to modify the default settings.   
If sliders don't align well with plots,
you may also need to adjust the size of your window.   
In all cases you can change the figure width (default = 12 in inch) in the call to iSpectrogram 
> iSpectorgram(figwidth=14, dpi=120)

### Exercise 1: Phonetic Segmentations

1. setting up:
    + work with interactive.iSpg1()
    + load misc/friendly.wav and load also the phonetic segmentation in misc/friendly.phn
    + load set your audio at a comfortable loudness when you play the sentence
2. focus on the first word 'friendly', segment, listen and comment
    + 'f-r-ih-n-d-l-iy'
    + 'ih-n-d-l-iy' 
    + 'f' and 'f-r'
    + what was your most striking observation
    + to what extent do you agree with the given segmentation, based on perception, based on time waveform and based on spectrogram ?

### Exercise 2: Spectrogram Parameters

1. setting up:
    + work with interactive.iSpg1()
    + load again misc/friendly.was with its phonetic transcript (or some other speech wavfile)
2. adjust different spectrogram settings, always start from defaults (shift=10msec, length=30msec, preemphasis=.97)
    + describe what you observed when deviating from the defaults
    + for what parameters and in what way does the spectrogram deviate from speech perception ?
    + choose as frame_length 10, 30, 50, 100 msec. Which values would you describe as good, acceptable, not acceptible and why ?
    + choose as frame_shift 5,10, 20 msec. Which values would you describe as good, acceptable, not acceptable and why ?




In [4]:
pyspch.interactive.Spg1(figwidth=12)                 

Spg1(children=(VBox(children=(Output(layout=Layout(border='solid 1px black', margin='1px', padding='1px', widt…

### Exercise 3: Spectrogram: pitch and formants


1. setting up:
    + work with interactive.iSpg2()
    + load any speech waveform  (suggestions: misc/expansionist.wav, misc/friendly.wav)
    + use default spectrogram parameters
    
2. Find pitch and formants in time and/or frequecy domain
    + put the range cursor in the middle of a vowel
    + find pitch in three ways: time waveform, spectral slice, spectrogram: are your values consistent ?
    + could you determine gender from the obtained pitch values
    + find vowel identity by finding first and second formant and then looking up in formant tables; which is your preferred view ?
    
3. Pitch and formants in the mel spectrum
    + add the mel spectrogram (and mel spectrum slide) to your view
    + find the formants in the mel spectrum both for low resolution (nb=20-30) and high resolution (>80) mel filterbanks
    + in which representation is found formants easiest ?
    + try to map formant frequencies to mel filterbank 

In [None]:
pyspch.interactive.Spg2(figwidth=12)     