# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Using-extreme-gradient-boosting-to-detect-glottal-closure-instants-in-speech-signal" data-toc-modified-id="Using-extreme-gradient-boosting-to-detect-glottal-closure-instants-in-speech-signal-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using extreme gradient boosting to detect glottal closure instants in speech signal</a></div><div class="lev2 toc-item"><a href="#Training-and-evaluating-the-classifier-on-UWB-data" data-toc-modified-id="Training-and-evaluating-the-classifier-on-UWB-data-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Training and evaluating the classifier on UWB data</a></div><div class="lev2 toc-item"><a href="#CMU-data" data-toc-modified-id="CMU-data-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>CMU data</a></div><div class="lev2 toc-item"><a href="#GCI-detection-evaluation" data-toc-modified-id="GCI-detection-evaluation-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>GCI detection evaluation</a></div>

# A COMPARISON OF CONVOLUTIONAL NEURAL NETWORKS FOR GLOTTAL CLOSURE INSTANT DETECTION FROM RAW SPEECH

This is an example of a Python code to train and test an InceptionV3-1D model, a deep one-dimensional convolutional neural network (CNN), for detecting glottal closure instants (GCIs) in the speech signal. See the [corresponding paper](paper/matousek_ICASSP2021_paper.pdf) for more details.

[Keras](https://keras.io/) (v2.3.1) with [TensorFlow](https://www.tensorflow.org/) (v1.15.3) backend are used to train and evaluate the CNN model.

Prerequisities are stored in the [requirements](requirements.txt) file.

Firstly, we make imports.

In [1]:
import os
import os.path as osp
import numpy as np
import random as pyrandom
import tensorflow as tf
import sklearn.metrics as skm
import utils
from inception1D import InceptionV31D

Using TensorFlow backend.


## Data

To show the training and evaluation of the InceptionV3-1D model, we describe data firstly. Note that just a [sample of data](data/sample) will be used in this tutorial (40 waveforms for training and 2 waveforms for testing from 2 voice talents). In the [corresponding paper](paper/matousek_ICASSP2021_paper.pdf), 3200 waveforms from 16 voice talents were used.

The following sample of data is used:
* `spc8 ...` speech waveforms downsampled to 8 kHz
* `negpeaks ...` indeces of negative peaks in the (filtered) speech waveform
* `targets ...` ground truth GCIs associated with the negative peaks (1=GCI, 0=non-GCI)

We used the [Multi-Phase Algorithm](http://www.sciencedirect.com/science/article/pii/S0167639311000094) (MPA) to detect GCIs from the contemporaneous electroglottograph (EGG) signal and used the detected GCIs as the ground truth ones.

As can be seen, the number of GCIs and non-GCIs in our data is heavily unbalanced:

In [2]:
utt_list = np.loadtxt('data/sample/train.txt', 'str')
targets = np.hstack([np.load(osp.join('data/sample/targets', u+'.npy')) for u in utt_list])

print('# peaks:   ', len(targets))
print('# GCI:     ', len(targets[targets > 0]))
print('# non-GCI: ', len(targets[targets == 0]))

# peaks:    10990
# GCI:      8659
# non-GCI:  2331


This is caused by the 8kHz sampling as there are fewer peaks in unvoiced segments taken as non-GCIs.

## Training and evaluating the CNN model

The following code sets the randomness and tries to ensure reproducibility

In [3]:
seed_value = 7
# Set `PYTHONHASHSEED` environment variable at a fixed value
os.environ['PYTHONHASHSEED'] = str(seed_value)
os.environ['CUDA_VISIBLE_DEVICES'] = ''
# Set python built-in pseudo-random generator at a fixed value
pyrandom.seed(seed_value)
# Set numpy pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set the tensorflow pseudo-random generator at a fixed value
tf.set_random_seed(seed_value)

# Configure a new global `tensorflow` session
from keras import backend as K
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

Then, we read train/validation data

In [4]:
X_trn, y_trn, input_shape = utils.load_data('data/sample/train.txt', 'data/sample/spc8', 'data/sample/negpeaks',
                                            'data/sample/targets', frame_length=0.03, winfunc=None)

and check the shape of inputted data

In [5]:
print('Input shape: ', input_shape)
print('# of training examples:', X_trn.shape[0])
print('# of samples per frame:', X_trn.shape[1])

Input shape:  (240, 1)
# of training examples: 10990
# of samples per frame: 240


In [6]:
X_val, y_val, input_shape = utils.load_data('data/sample/val.txt', 'data/sample/spc8', 'data/sample/negpeaks',
                                            'data/sample/targets', frame_length=0.03, winfunc=None)

In [7]:
print('# of validation examples:', X_val.shape[0])

# of validation examples: 457


In this example, we use 1D version of the InceptionV3 model which is shown in the paper to achieve best results on the test set. The definition of the model is as follows:

In [None]:
# Model definition
model = InceptionV31D(input_shape)
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Then, we can train the model on the train set and evaluate it on the validation set:

In [9]:
history = model.fit(X_trn, y_trn, validation_data=(X_val, y_val), epochs=2, batch_size=128, verbose=1)


Train on 10990 samples, validate on 457 samples
Epoch 1/2
Epoch 2/2


In this very simplified example, the accuracy on the validation set was about 83%. Much better results can be obtained when more training data from more voice talents is used, when tuning of the hyper-parameters (such as the frame size, batch size, learning rate, etc.) is done and also when the model is trained for more epochs. Please see the [paper](paper/Matousek_ICASSP2021_paper.pdf) for more details.

Since the data is unbalanced, the _accuracy_ score could be confusing. In the [paper](paper/Matousek_ICASSP2021_paper.pdf), we use $F1$, _recall_ ($R$), and _precision_ ($P$) scores. For this purpose, we firstly get the prediction of each peak to be GCI or non_GCI 

In [10]:
# Predict to get some other metrics
y_proba = model.predict(X_val, verbose=1)[:, 0]
y_pred = utils.proba2classes(y_proba)



and then we use [Scikit-learn](http://scikit-learn.org/stable/) tools to calculate the measures

In [13]:
print('F1 = {:.2%}'.format(skm.f1_score(y_val, y_pred)))
print('R  = {:.2%}'.format(skm.recall_score(y_val, y_pred)))
print('P  = {:.2%}'.format(skm.precision_score(y_val, y_pred)))

F1 = 90.52%
R  = 100.00%
P  = 82.68%
