# Lab 10: Keyword Recognition

## Part 1: Making a Digit Recognizer

In this section we will design a simple spoken digit recognizer, based on Dynamic Time Warping (DTW). In order to make such a system we need to first collect some data, and then design a DTW routine that can compare new inputs with templates for each digit.

To start with make a set of data that will be used here. Make a dozen or so recordings of yourself speaking each of the ten digits (0 to 9). We will use one recording from each digit as the template, and the rest at testing data. In order to not spend too much time collecting the data, record all these utterances in a single (long) sound file. Use your voice activity detector to split that file into the individual spoken digits.

In order to design a digit recognizer we will take a spoken input of a digit and compare it to each digit’s template. By finding which template is the most similar we can classify the input as belonging to that template’s digit. In order to measure the distance between the two sequences we have to use DTW on an appropriate feature space.

Decide which feature to use to represent your speech signals. It can be any feature that we used in the past (e.g. some type of an STFT, MFCCs, etc). When comparing a template with a new input you need to perform the following steps:

1. Compute the distance matrix between all the features of each input. This will be a $M$ by $N$ matrix in which the $(i, j)$ element will represent the distance between the $i$-th frame of the template and the $j$-th frame of the input. We will use the cosine distance which is defined as:

$$D(\mathbf{a},\mathbf{b}) = \frac{\sum a_i b_i}{\sqrt{a_i^2}\sqrt{\sum b_i^2}}$$

2. Once you obtain the distance matrix, you need to compute the cost matrix that encodes the cost of passing through a node given a previously optimal path. We will use the local constraint that to reach node $(i, j)$ you can either come from nodes $(i–1, j–1)$, $(i, j–1)$ or $(i–1, j)$.

3. Starting from the first element of the matrix (1,1), and for each element of the cost matrix you will need to perform the following steps. For node $(i, j)$ you need to examine the nodes from which you can reach it – these will be nodes $(i–1, j–1)$, $(i, j–1)$ or $(i–1, j)$ – and see which one has the lowest cost. Therefore, reaching that node from the optimal path will have the cost of the optimal preceding node plus the distance that corresponds to being at node $(i, j)$. Iterate until you calculate the cost of passing through every node. As you do that, for each node keep track of which of the three preceding nodes was the optimal one.

4. Now you can backtrack and find the optimal path. Start from the final point of the cost matrix and find the node from which you arrived there (it will be the same one that had the lowest cost above). Once you get to that node, repeat this process until you reach the beginning indexes of the two sequences. The path that you took in this process will be the optimal path that aligns the two sequences.

5. The distance between the two sequences will be the cost of being at the final node. Use this to perform the digit classification.

corrected formula:
$$D(\mathbf{a}, \mathbf{b}) = 1 - \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2}\sqrt{\sum_i b_i^2}}$$


In [1]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

def sound( x, rate=8000, label=''):
    from IPython.display import display, Audio, HTML
    if label is '':
        display( Audio( x, rate=rate))
    else:
        display( HTML( 
        '<style> table, th, td {border: 0px; }</style> <table><tr><td>' + label + 
        '</td><td>' + Audio( x, rate=rate)._repr_html_()[3:] + '</td></tr></table>'
        ))

data, sr = librosa.load("record.m4a")
data = data[sr:]
sound( data, rate=sr, label="record")



0,1
record,Your browser does not support the audio element.


In [41]:
fft_size = 1024


s = np.zeros(len(data), dtype=float)
detection = np.zeros(len(data) // 1024, dtype=int)
digit_sample = []

for i in range(0, len(data) // 1024):
    if max(data[i * 1024:(i + 1) * 1024] ** 2) > 0.000005:
        detection[i:i+4] = [1, 1, 1, 1]

active = False
for i in range(0, len(data) // 1024):
    if detection[i]:
        if active == False:
            active = True
            digit_sample.append(np.zeros(1, dtype=float))
        digit_sample[-1] = np.append(digit_sample[-1], data[i * 1024:(i + 1) * 1024])
    else:
        if active == True:
            active = False
            digit_sample[-1] = np.append(digit_sample[-1], data[i * 1024:i * 1024 + (17409 - len(digit_sample[-1]))])

# delete false positives 
digit_sample.pop(53)
digit_sample.pop(70)
digit_sample.pop(110)

# make first ten inputs as templates

template_mfccs = []
for i in range(0, 10):
    template_mfccs.append(librosa.feature.mfcc(y=digit_sample[i], sr=sr, n_mfcc=512))
    
# classify other inputs
correct_count = 0
for i in range(10, 120):
    input_mfcc = librosa.feature.mfcc(y=digit_sample[i], sr=sr, n_mfcc=512).T
    min_cost = float("inf")
    min_digit = None
    for j in range(0, 10):
        distance = 1 - (np.matmul(input_mfcc, template_mfccs[j]) / \
                        np.dot(np.sum(input_mfcc ** 2, axis=1)[:,None] ** 0.5, 
                               np.sum(template_mfccs[j] ** 2, axis=0)[None,:] ** 0.5))
        
        #calculate cost matrix
        cost = np.zeros(distance.shape, dtype=float)
        for m in range(len(distance)):
            for n in range(len(distance[0])):
                if m == n == 0:
                    cost[m][n] = distance[0][0]
                elif m == 0:
                    cost[m][n] = cost[m][n - 1] + distance[m][n]
                elif n == 0:
                    cost[m][n] = cost[m - 1][n] + distance[m][n]
                else:
                    cost[m][n] = min(cost[m][n - 1], cost[m - 1][n - 1], cost[m - 1][n]) + distance[m][n]
        #print(cost[-1][-1])
        if cost[-1][-1] < min_cost:
            min_cost = cost[-1][-1]
            min_digit = j
        #plt.imshow(cost)
        #plt.show()
    #print("test %d is classified as class %d" % (i, min_digit))
    if i % 10 == min_digit:
        correct_count += 1
print("precision:", correct_count / 110)
    
        

precision: 1.0


## Part 2. Making a voice-driven dialer

Suppose you just started working for a phone company and the first thing they ask you is to make a hands-free interface for their phones so that people can dial in their friends by voice. During setup, the users speak the name of a contact and then associate it with a number to call. Make a system for which you use the full name of 4-5 of your friends, so that when you speak their name the system recognizes it (and thus could subsequently call their number)

In [48]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

def recognizer(book, test):
    input_mfcc = librosa.feature.mfcc(test[:sr*4], sr=sr, n_mfcc=512)
    min_cost = float("inf")
    nearest = None
    for k, v in book.items():
        distance = 1 - (np.matmul(input_mfcc.T, v[0]) / \
                        np.dot(np.sum(input_mfcc.T ** 2, axis=1)[:,None] ** 0.5, 
                               np.sum(v[0] ** 2, axis=0)[None,:] ** 0.5))
        cost = np.zeros(distance.shape, dtype=float)
        for m in range(len(distance)):
            for n in range(len(distance[0])):
                if m == n == 0:
                    cost[m][n] = distance[0][0]
                elif m == 0:
                    cost[m][n] = cost[m][n - 1] + distance[m][n]
                elif n == 0:
                    cost[m][n] = cost[m - 1][n] + distance[m][n]
                else:
                    cost[m][n] = min(cost[m][n - 1], cost[m - 1][n - 1], cost[m - 1][n]) + distance[m][n]
        if cost[-1][-1] < min_cost:
            min_cost = cost[-1][-1]
            nearest = v[1]
    return nearest

data, sr = librosa.load("templates.m4a")
sound(data[:sr*30], rate=sr, label='names')
book = {"Percy Owen": [librosa.feature.mfcc(data[:sr*4], sr=sr, n_mfcc=512), "409-213-5433"],
        "Floyd Millar": [librosa.feature.mfcc(data[sr*5:sr*9], sr=sr, n_mfcc=512), "534-332-1556"],
        "Zavier Pickett": [librosa.feature.mfcc(data[sr*10:sr*14], sr=sr, n_mfcc=512), "312-211-4478"],
        "Daniaal Robin": [librosa.feature.mfcc(data[sr*15:sr*19], sr=sr, n_mfcc=512), "294-223-8574"],
        "David Mckenna": [librosa.feature.mfcc(data[sr*20:sr*24], sr=sr, n_mfcc=512), "306-332-1866"]}

# My laptop environment does not support pysoundcard, so I prerecord the test sound.
test, sr = librosa.load("test.m4a")
sound(test, rate=sr, label='input')

phone = recognizer(book, test[:sr*4])

print("dialing number \"%s\"" % phone)



0,1
names,Your browser does not support the audio element.




0,1
input,Your browser does not support the audio element.


dialing number "306-332-1866"
