## Importing the necessary packages.
Make sure you have all of them installed.

In [1]:
import numpy as np
import librosa
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import plotly.plotly as py
import plotly.graph_objs as go

## Function used to load the audio files.
It loads the audio file with its time labels file which is used to cut out the words from the whole recording. Input argument is the name of the file without its format. Function returns two lists, **words** contains arrays with values of the recorded of each word and **labels** contains strings with words corresponging to the recordings. So after using this function we get cut out words without values of recorded noise between them.

In [2]:
def Signal_load(file_name):
    signal, sr = librosa.load('Recordings/{}.wav'.format(file_name), sr = 44100)
    time_labels = np.loadtxt('Recordings/{}.txt'.format(file_name), dtype='str', delimiter='\t')
    words = []
    labels = []
    for i in range(0, len(time_labels)):
        start = int(float(time_labels[i][0]) * sr)
        end = int(float(time_labels[i][1]) * sr)
        word = time_labels[i][2]
        labels.append(word)
        word_signal = signal[start:end]
        word_signal = word_signal / np.amax(np.absolute(word_signal))
        words.append(word_signal / np.amax(np.absolute(word_signal)))
    return words, labels

## Function used to extract features from each of the words.
It uses mfcc function from librosa package which returns mel-frequency cepstral coefficients of the recording which needs to be given as nupmy array. So input argument to the function is numpy array and it also returns 1D numpy array but with 40 elements.

In [3]:
def Extract_features(signal):
    mfccs = np.mean(librosa.feature.mfcc(y = signal, sr = 44100, n_mfcc = 40).T, axis = 0)
    mfccs = np.array([mfccs])
    return mfccs

## First example.
In this example recordings of the same person are used. Three of them as training data set and one as testing data set. First all of the recordings are being loaded.

In [4]:
train_words_1, train_labels_1 = Signal_load('266753_23_M_19_1')
train_words_2, train_labels_2 = Signal_load('266753_23_M_20_2')
train_words_3, train_labels_3 = Signal_load('266753_23_M_20_3')
train_words = train_words_1 + train_words_2 + train_words_3
train_labels = train_labels_1 + train_labels_2 + train_labels_3
test_words, test_labels = Signal_load('266753_23_M_21_4')

After that we extract the features.

In [5]:
train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
 
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)

Now we create SVC object called clf (it's our classifier). fit function is used to train the classifier with train_features array where each row contains the features of single word and train_labels which is a list where every element is the name of the corresponding recorded word, so it's basically list with names of the classes. After the classifier learned from the training data set we use predict function which uses its new knowledge. It tries to predict names of the classes (in this case words) basing on the features it gets.

In [6]:
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

Accuracy: 0.7692307692307693


The result is relatively satisfying taking under consideration that training data set is not very large. Below you can see expected results and the ones we get.

In [7]:
print('Expected words: \n{}'.format(test_labels))
print('\nPredicted words: \n{}'.format(preds))

Expected words: 
['OTWORZ', 'ZAMKNIJ', 'GARAZ', 'ZROB', 'NASTROJ', 'WLACZ', 'WYLACZ', 'MUZYKE', 'SWIATLO', 'ZAPAL', 'PODNIES', 'ROLETY', 'TELEWIZOR']

Predicted words: 
['OTWORZ' 'ZAMKNIJ' 'GARAZ' 'ZROB' 'NASTROJ' 'TELEWIZOR' 'TELEWIZOR'
 'MUZYKE' 'SWIATLO' 'ZAPAL' 'PODNIES' 'TELEWIZOR' 'TELEWIZOR']


To see it more clearly we can present the results on the heat map.

In [8]:
heat_map = np.zeros((len(test_labels), len(preds)))
for i in range(0, len(test_labels)):
    for j in range(0, len(preds)):
        if test_labels[i] == preds[j]:
            heat_map[i][j] = 1
heat_map = (heat_map / 1) * 100            

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

On the X axis are words that were given as testing data set, so theye are the expected words. On the Y axis are predicted values, so the words we get from our classifier. Colours show the percentage of words being classified to each class, in this case as a word. You can see the exact percentage if you move the cursor over the part of the graph you're interested in. The perfect outcome would be yellow diagonal line that would mean that each word was classified correctly. The results are not bad, we can see that only 3 out of 13 words were wrongly classified. What is interesting is the fact that all wrongly classified words were classified to the same class (TELEWIZOR). 

## Second example.
This time we use two recordings from the same person as training data set and  two remaining recordings from that person as testing data set.

In [9]:
train_words_1, train_labels_1 = Signal_load('266753_23_M_19_1')
train_words_2, train_labels_2 = Signal_load('266753_23_M_20_2')
train_words = train_words_1 + train_words_2
train_labels = train_labels_1 + train_labels_2
test_words_1, test_labels_1 = Signal_load('266753_23_M_20_3')
test_words_2, test_labels_2 = Signal_load('266753_23_M_21_4')
test_words = test_words_1 + test_words_2
test_labels = test_labels_1 + test_labels_2

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
 
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((len(test_labels_1), len(test_labels_1)))
for i in range(0, len(test_labels_1)):
    for j in range(0, len(test_labels_1)):
        if test_labels[i] == preds[j]:
            heat_map[i][j] += 1
        if test_labels[i + len(test_labels_1)] == preds[j + len(test_labels_1)]:
            heat_map[i][j] += 1
heat_map = (heat_map / 2) * 100 
            
trace = go.Heatmap(z = heat_map,
                   x = test_labels_1,
                   y = test_labels_1,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 0.6153846153846154


The results are not bad (61.5%). On the heat map we can see that some of the words were classified correctly two times (OTWORZ, GARAZ, MUZYKE, ZAPAL) and of was classified wrongly two times (ZROB). It is shown in yellow colour if the word was classified two times as the same word and in green if it was classified one time as some word, for example NASTROJ was one time classified as NASTROJ and one time as MUZYKE.

## Third example.
In this example we're going to use the same audio files, but now all of them will be used as trainig data set and one as testing data set.

In [10]:
train_words_1, train_labels_1 = Signal_load('266753_23_M_19_1')
train_words_2, train_labels_2 = Signal_load('266753_23_M_20_2')
train_words_3, train_labels_3 = Signal_load('266753_23_M_20_3')
train_words_4, train_labels_4 = Signal_load('266753_23_M_21_4')
train_words = train_words_1 + train_words_2 + train_words_3 + train_words_4
train_labels = train_labels_1 + train_labels_2 + train_labels_3 + train_labels_4
test_words, test_labels = Signal_load('266753_23_M_21_4')

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
 
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((len(test_labels), len(preds)))
for i in range(0, len(test_labels)):
    for j in range(0, len(preds)):
        if test_labels[i] == preds[j]:
            heat_map[i][j] = 1
heat_map = (heat_map / 1) * 100 

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 1.0


As could be foreseen the results are better, accuracy is 100% and on the heat map we can see that ideal outcome mentioned earlier.

## Fourth example.
Now training data set will again consist of 4 recording from the same person, but testing data set will be one recording from different person.

In [11]:
train_words_1, train_labels_1 = Signal_load('266753_23_M_19_1')
train_words_2, train_labels_2 = Signal_load('266753_23_M_20_2')
train_words_3, train_labels_3 = Signal_load('266753_23_M_20_3')
train_words_4, train_labels_4 = Signal_load('266753_23_M_21_4')
train_words = train_words_1 + train_words_2 + train_words_3 + train_words_4
train_labels = train_labels_1 + train_labels_2 + train_labels_3 + train_labels_4
test_words, test_labels = Signal_load('266701_23_M_11_1')

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
 
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((len(test_labels), len(preds)))
for i in range(0, len(test_labels)):
    for j in range(0, len(preds)):
        if test_labels[i] == preds[j]:
            heat_map[i][j] = 1
heat_map = (heat_map / 1) * 100 

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 0.23076923076923078


The results are not very good, only 3 out of 13 words were classified correctly. It can mean that if we want to use this code in a smart home system it would have to be configured by the person that will use it, but I think it is commonly used in speech recognition systems.

## Fifth example.
In this example we're going to use all of the available recordings (54). There was more of the recordings, but the files were corrupted. Recordings come from 14 people, 13 of them recorded 4 files and 1 of them recorded 2. Files were divided into two equal data sets, training and testing data set. In each one of them is exactly half of the recording from each person, for example 2 of my recordings are in training data set and the other 2 in testing data set and acordingly for every person.

In [12]:
train_words = []
train_labels = []
for i in range(1, 28):
    word, label = Signal_load(str(i))
    train_words = train_words + word
    train_labels = train_labels + label
    
train_labels = np.asarray(train_labels)

test_words = []
test_labels = []
for i in range(28, 55):
    word, label = Signal_load(str(i))
    test_words = test_words + word
    test_labels = test_labels + label

test_labels = np.asarray(test_labels)

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
    
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((13, 13))
for i in range(0, 27):
    for j in range(0, 13):
        for k in range(0, 13):
            if test_labels[j + (13 * i)] == preds[k + (13 * i)]:
                heat_map[j][k] += 1
heat_map = (heat_map / 27) * 100

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 0.6153846153846154


Again the results are quite satisfying (61.5%). We can see that 24 out of 27 times (89%) ROLETY was classified correctly and GARAZ 22 times (81%). The worst results we get from classifying ZAMKNIJ, WLACZ and WYLACZ (11 out of 27 times - 41%). WLACZ was 6 times (22%) classified wrongly as SWIATLO.

## Sixth example.
This time 41 recordings were used as training data set and remaining 13 as testing data set.

In [13]:
train_words = []
train_labels = []
for i in range(1, 42):
    word, label = Signal_load(str(i))
    train_words = train_words + word
    train_labels = train_labels + label
    
train_labels = np.asarray(train_labels)

test_words = []
test_labels = []
for i in range(42, 55):
    word, label = Signal_load(str(i))
    test_words = test_words + word
    test_labels = test_labels + label

test_labels = np.asarray(test_labels)

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
    
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((13, 13))
for i in range(0, 13):
    for j in range(0, 13):
        for k in range(0, 13):
            if test_labels[j + (13 * i)] == preds[k + (13 * i)]:
                heat_map[j][k] += 1
heat_map = (heat_map / 13) * 100 

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 0.6153846153846154


Same as in the previous example we get 61.5% accuracy. GARAZ and ROLETY gave again the best results, but this time same results we get also from PODNIES and TELEWIZOR (10 out of 13 times - 77%). Again WLACZ gave the wors results (3 out of 13 times - 23%) with WYLACZ (4 out of 13 times - 31%). ZAMKNIJ improved a bit (7 out of 13 times - 54%, previously 41%)

## Seventh example.
In this last example we use 53 out of 54 recordings as training data set and one remaining as testing data set.

In [14]:
train_words = []
train_labels = []
for i in range(1, 54):
    word, label = Signal_load(str(i))
    train_words = train_words + word
    train_labels = train_labels + label
    
train_labels = np.asarray(train_labels)

test_words, test_labels = Signal_load('54')

test_labels = np.asarray(test_labels)

train_features = np.empty((0, 40))
for i in range(0, len(train_words)):
    feature = Extract_features(train_words[i])
    train_features = np.append(train_features, feature, axis = 0)
    
test_features = np.empty((0, 40))
for i in range(0, len(test_words)):
    feature = Extract_features(test_words[i])
    test_features = np.append(test_features, feature, axis = 0)
    
clf = SVC(kernel = "linear", C = 0.025)
clf.fit(train_features, train_labels)
preds = clf.predict(test_features)
print('Accuracy: {}'.format(accuracy_score(test_labels, preds)))

heat_map = np.zeros((len(test_labels), len(preds)))
for i in range(0, len(test_labels)):
    for j in range(0, len(preds)):
        if test_labels[i] == preds[j]:
            heat_map[i][j] = 1
heat_map = (heat_map / 1) * 100 

trace = go.Heatmap(z = heat_map,
                   x = test_labels,
                   y = test_labels,
                   colorscale='Viridis')

data=[trace]
py.iplot(data)

Accuracy: 0.6153846153846154


Third time in a row we get 61.5% accuracy. Like before ZAMKNIJ, WLACZ and WYLACZ were classified wrongly, but now also ZROB and MUZYKE was classified not so good. Based on all of the heat maps we can notice a pattern in which some of the words are ofthe classified incorectly (ZAMKNIJ, WLACZ, WYLACZ) and some of them properly (OTWORZ, ROLETY, TELEWIZOR).