# Load each of the speaker files
Librosa loads in the speaker files that were premade for each individual speaker. It returns two values, the series of audio data for each speaker, and the sample rate for each speaker. The sample rate was set to 8 KHz as an argument for the function. We now have a matrix for each individual speaker, labeled by letter.

In [93]:
import librosa
import numpy as np
import pandas as pd

a, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker1.wav', sr=8000)
b, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker2.wav', sr=8000)
c, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker3.wav', sr=8000)
d, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker4.wav', sr=8000)
e, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker5.wav', sr=8000)
f, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker6.wav', sr=8000)
g, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker7.wav', sr=8000)
h, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker8.wav', sr=8000)
i, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker9.wav', sr=8000)
j, sr = librosa.load('/Users/James/Documents/SBU/ISE390/SpeechProject/Audio/ConcatenatedAudio/speaker10.wav', sr=8000)

# Take the log of the absolute value of the fourier transform, segmented into 20 ms windows with 10ms overlap
Librosa's stft function takes the short time fourier transform of the series of audio data that is provided as an argument. Since the sample rate for these series' is 8 KHz, that means that 160 samples would be 20 ms and 80 samples would be 10 ms. Set the argument n_fft = 160 so that the window size that is returned in a matrix is 20 ms and then hop_length is set to 10 ms so that there is also an overlap of 10 ms for each window that is selected. We get a new matrix for each speaker doing this.

In [94]:
A = np.log(np.abs(librosa.stft(a, n_fft=160,hop_length=80)))
B = np.log(np.abs(librosa.stft(b, n_fft=160,hop_length=80)))
C = np.log(np.abs(librosa.stft(c, n_fft=160,hop_length=80)))
D = np.log(np.abs(librosa.stft(d, n_fft=160,hop_length=80)))
E = np.log(np.abs(librosa.stft(e, n_fft=160,hop_length=80)))
F = np.log(np.abs(librosa.stft(f, n_fft=160,hop_length=80)))
G = np.log(np.abs(librosa.stft(g, n_fft=160,hop_length=80)))
H = np.log(np.abs(librosa.stft(h, n_fft=160,hop_length=80)))
I = np.log(np.abs(librosa.stft(i, n_fft=160,hop_length=80)))
J = np.log(np.abs(librosa.stft(j, n_fft=160,hop_length=80)))

# Get the shape of each speaker's matrix
This returns the shape of each of the speakers' individual matrix. We use this to figure out the smallest one so that we can later resize each of the matrices to the same size.

In [95]:
print(A.shape, B.shape, C.shape, D.shape, E.shape, F.shape, G.shape, H.shape, I.shape, J.shape)

(81, 21461) (81, 15085) (81, 14297) (81, 18885) (81, 16497) (81, 22005) (81, 23833) (81, 23081) (81, 19637) (81, 29009)


# Resize each matrix so that they are all the same size
This slices each of the matrices so that they are all the same length (amount of windows).

In [96]:
A = A[:, :14297]
B = B[:, :14297]
C = C[:, :14297]
D = D[:, :14297]
E = E[:, :14297]
F = F[:, :14297]
G = G[:, :14297]
H = H[:, :14297]
I = I[:, :14297]
J = J[:, :14297]

# Create the target array of ID's for each speaker
Each window should be labeled to the speaker it is from. To do this, we set an array named target. Since there are 14,297 windows for each speaker, we set that many target labels in an array to match that. The size should be the same as the amount of windows x the amount of speakers.

In [97]:
target = np.array(np.zeros(shape=(142970,)))
target[0 : 14297] = 0
target[14297 :28594] = 1
target[28594 :42891] = 2
target[42891 :57188] = 3
target[57188 :71485] = 4
target[71485 :85782] = 5
target[85782 :100079] = 6
target[100079 :114376] = 7
target[114376 :128673] = 8
target[128673 :142970] = 9

# Concatenate the transpose of each of the matrices so that the shape matches the target value matrix shape
We then need to take the transpose of each of these matrices, since they are not the correct shape initially. We then concatenate each of these transposed matrices into one larger matrix named "data". This is the matrix that is going to be fed into the neural network, along with the target array of labels. I also made sure to set the data type of the target array to integers, as the data would not be compatible otherwise with the neural network algorithm. Then, to make sure that they fit with each other and are the same size, we print both of the shapes of the matrices.

In [98]:
data = np.concatenate([A.T ,B.T, C.T, D.T, E.T, F.T, G.T, H.T, I.T, J.T])
target = target.astype(int)
print("Data size: ", data.shape, "\nTarget shape: ", target.shape)

Data size:  (142970, 81) 
Target shape:  (142970,)


# Train the neural network and return the accuracy results
There are several steps involved in using the neural network to predict speakers. 

First, the data is fed in using the get_data() function. This uses the data and creates the one hot vectors that are needed. It then returns the data into 4 different values, a training and test set for both the data and the target values. 90% is assigned as training data, and 10% as test data.

Once this is done, the size of the input, output, and amount of hidden layers get set. 

Then, the weights are initialized to random values at first. 

After this, forward propogation is performed in order to get the output and the error of the weights. 

Then after getting the error, you use that to calculate a gradient that is needed in the calculation of the weights to be used in the network. This is done with backwards propogation using the gradient descent algorithm. The weights get adjusted based on the learning rate. 

After doing all of this, the tensorflow session gets initialized. Once this is done, the neural network can be trained and start making predictions on the speaker data. For every value of a window from the data in the training set, tensorflow runs a session with that and the updates it is receiving from the gradient descent to minimize the cost function. This gets run for every single window in the training data, and the prediction is compared to the labels in the target matrix.

Once the predictions are done being made for all of the test data, the average accuracy of the training and test data predictions are calculated, and then printed at the end of every "epoch". These few steps are all part of one epoch, and this gets run a designated amount of times. In this case, it was run 100 times. 

You can see the results of the predictions at the bottom of the page. There is also an image of them included in the project folder. The results steadily improved over time, and appear as if they would continue doing so if given enough time to train the neural network. 

In [99]:
# Implementation of a simple MLP network with one hidden layer. Tested on the iris data set.
# Requires: numpy, sklearn>=0.18.1, tensorflow>=1.0

# NOTE: In order to make the code simple, we rewrite x * W_1 + b_1 = x' * W_1'
# where x' = [x | 1] and W_1' is the matrix W_1 appended with a new row with elements b_1's.
# Similarly, for h * W_2 + b_2

# author :vinhkhuc  Feb 26, 2017
import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42
tf.set_random_seed(RANDOM_SEED)


def init_weights(shape):
    """ Weight initialization """
    weights = tf.random_normal(shape, stddev=0.1)
    return tf.Variable(weights)

def forwardprop(X, w_1, w_2):
    """
    Forward-propagation.
    IMPORTANT: yhat is not softmax since TensorFlow's softmax_cross_entropy_with_logits() does that internally.
    """
    h    = tf.nn.sigmoid(tf.matmul(X, w_1))  # The \sigma function
    yhat = tf.matmul(h, w_2)  # The \varphi function
    return yhat

def get_data():
    """ Read the iris data set and split them into training and test sets """
    # Prepend the column of 1s for bias
    N, M  = data.shape
    all_X = np.ones((N, M + 1))
    all_X[:, 1:] = data

    # Convert into one-hot vectors
    num_labels = len(np.unique(target))
    all_Y = np.eye(num_labels)[target]  # One liner trick!
    return train_test_split(all_X, all_Y, test_size=0.1, random_state=RANDOM_SEED)

def main():
    train_X, test_X, train_y, test_y = get_data()

    # Layer's sizes
    x_size = train_X.shape[1]   # Number of input nodes: 4 features and 1 bias
    h_size = 20                 # Number of hidden nodes
    y_size = train_y.shape[1]   # Number of outcomes (3 iris flowers)

    # Symbols
    X = tf.placeholder("float", shape=[None, x_size])
    y = tf.placeholder("float", shape=[None, y_size])

    # Weight initializations
    w_1 = init_weights((x_size, h_size))
    w_2 = init_weights((h_size, y_size))

    # Forward propagation
    yhat    = forwardprop(X, w_1, w_2)
    predict = tf.argmax(yhat, axis=1)

    # Backward propagation
    cost    = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=yhat))
    updates = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

    # Run SGD
    #gpu_options = tf.GPUOptions(allow_growth=True)
    sess = tf.Session() # config=tf.ConfigProto(gpu_options=gpu_options))
    init = tf.global_variables_initializer()
    sess.run(init)

    for epoch in range(100):
        # Train with each example
        for i in range(len(train_X)):
            sess.run(updates, feed_dict={X: train_X[i: i + 1], y: train_y[i: i + 1]})

        train_accuracy = np.mean(np.argmax(train_y, axis=1) ==
                                 sess.run(predict, feed_dict={X: train_X, y: train_y}))
        test_accuracy  = np.mean(np.argmax(test_y, axis=1) ==
                                 sess.run(predict, feed_dict={X: test_X, y: test_y}))

        print("Epoch = %d, train accuracy = %.2f%%, test accuracy = %.2f%%"
              % (epoch + 1, 100. * train_accuracy, 100. * test_accuracy))

    sess.close()

if __name__ == '__main__':
    main()
    
# Run this with printing the dataset to see how it is organized. Many rows, each assigned with a target value

Epoch = 1, train accuracy = 37.41%, test accuracy = 36.98%
Epoch = 2, train accuracy = 45.27%, test accuracy = 44.90%
Epoch = 3, train accuracy = 43.77%, test accuracy = 43.31%
Epoch = 4, train accuracy = 46.65%, test accuracy = 46.39%
Epoch = 5, train accuracy = 46.00%, test accuracy = 46.26%
Epoch = 6, train accuracy = 42.97%, test accuracy = 42.06%
Epoch = 7, train accuracy = 42.83%, test accuracy = 42.35%
Epoch = 8, train accuracy = 47.27%, test accuracy = 46.86%
Epoch = 9, train accuracy = 45.68%, test accuracy = 45.42%
Epoch = 10, train accuracy = 49.77%, test accuracy = 49.47%
Epoch = 11, train accuracy = 45.65%, test accuracy = 45.01%
Epoch = 12, train accuracy = 47.59%, test accuracy = 46.98%
Epoch = 13, train accuracy = 46.16%, test accuracy = 45.88%
Epoch = 14, train accuracy = 49.37%, test accuracy = 49.38%
Epoch = 15, train accuracy = 48.41%, test accuracy = 48.15%
Epoch = 16, train accuracy = 46.72%, test accuracy = 46.32%
Epoch = 17, train accuracy = 48.17%, test accurac