![msp](https://msdnshared.blob.core.windows.net/media/2016/11/Microsoft_Student_Partner2.jpg)


# Spoken Digit Challenge

This is the first challenge of our Speech and Machine Learning Workshop. Here we will use the [FSDD][] Free Spoken Digit Dataset to build different models and recognize the digits from speech.   

** Note: ** Make sure that your dataset is in the correct folder - if there´s something not working for you, feel free to ask.

* 1500 recordings in total (150 per digit)
* 8kHz sampling rate
* 3 speakers
* English 
* File format: {digit\_label}\_{speaker\_name}\_{index}.wav <br> (e.g. "4\_jackson\_16.wav")

[FSDD]: https://github.com/Jakobovski/free-spoken-digit-dataset  

## Setup & Data Import

First, we will extract our features from the audio files. Two files will be generated - one for the features and one for the corresponding labels. Each line in our feature-label-pair will represent a single audio file.

In [None]:
# Import the relevant modules to be used later
import glob
import os
import librosa, librosa.display
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import specgram

In [None]:
# Dataset directory
DATASET_DIR = "dataset/"

# Create a list of all .wav files in the dataset directoy paths
single_speaker = True
if single_speaker:
    sound_paths = [DATASET_DIR + f for f in os.listdir(DATASET_DIR) if f[-4:] == '.wav' and 'jackson' in f]
else:
    sound_paths = [DATASET_DIR + f for f in os.listdir(DATASET_DIR) if f[-4:] == '.wav']

### Data Exploration

Let's visualize different sound files. 

In [None]:
# Let's start with having a look at our data

def plot_wave(sound_filenames):
    plt.figure(figsize=(15, 2 * len(sound_filenames)))
    i = 1
    for filename in sound_filenames:
        data, sample_rate = librosa.load(DATASET_DIR + filename)
        digit_caption = "Digit " + os.path.basename(filename)[0]
        
        plt.subplot(np.ceil(float(len(sound_filenames))/2), 2, i)
        librosa.display.waveplot(np.array(data),sr=sample_rate)
        i += 1
        plt.ylabel('Amplitude')
        plt.title(digit_caption)
    plt.subplots_adjust(top=0.8, bottom=0.08, left=0.10, right=0.95, hspace=0.5, wspace=0.35)
    plt.show()
    
def plot_spectrogram(sound_filenames, spec_type ='MEL'):
    i = 1
    plt.figure(figsize=(15, 2 * len(sound_filenames)))
    for filename in sound_filenames:
        data, sample_rate = librosa.load(DATASET_DIR + filename)
        digit_caption = "Digit " + os.path.basename(filename)[0]
        
        plt.subplot(np.ceil(float(len(sound_filenames))/2), 2, i)
              
        if spec_type == 'FFT':
            # Plot FFT spectrogram
            fft = librosa.stft(data, n_fft=256)
            librosa.display.specshow(librosa.amplitude_to_db(fft,ref=np.max),y_axis='log', x_axis='time') 
        else:
            # Plot MEL spectrogram
            mel_spectrogram = librosa.feature.melspectrogram(y=data, sr=sample_rate, n_mels=128, fmax=8000)
            librosa.display.specshow(librosa.power_to_db(mel_spectrogram,ref=np.max), y_axis='mel', fmax=8000, x_axis='time')

        plt.title(digit_caption)
        i += 1
    plt.subplots_adjust(top=0.8, bottom=0.08, left=0.10, right=0.95, hspace=0.5, wspace=0.35)
    plt.show()

In [None]:
# Target sound filenames for visualization
sound_filenames = [str(i) + '_jackson_0.wav' for i in range(0, 10)]

In [None]:
# Visualize wave files
plot_wave(sound_filenames)

In [None]:
# Plot spectogram
plot_spectrogram(sound_filenames)

## Feature Extraction

First, we will extract our features from the audio files. Two files will be generated - one for the features and one for the corresponding labels. Each line in our feature-label-pair will represent a single audio file.

### MFCC
We will use a mel spectogram here, which considers the perceived pitch of a tone. 
To read more click [here](https://archive.is/20130414065947/http://asadl.org/jasa/resource/1/jasman/v8/i3/p185_s1).

In [None]:
# Number of mel filters we want to extract 
# A higher number indicates a higher resolution of our signal - however, this
# means that we need to train more parameters
# 20 - 40 is used in most cases
n_mels = 40
# The frame size depends on the audio length
# For longer samples, the value should be increased
frame_size=25

def extract_features(file_name):
    data, sample_rate = librosa.load(file_name)
    melgram = librosa.feature.melspectrogram(data, sr=sample_rate, n_mels=n_mels)
    #print('Original shape ouf our mel spectrogram data: ' + str(melgram.shape))
    
    # if our audio is shorter than the frame size, pad with zeroes
    if melgram.shape[1] < frame_size:
        pad_width = frame_size - melgram.shape[1]
        melgram = np.pad(melgram, pad_width=((0, 0), (0, pad_width)), mode='constant')
        #print('Extended shape ouf our mel spectrogram data: ' + str(melgram.shape))
    # if it is longer, cut it down
    elif melgram.shape[1] > frame_size:
        melgram = melgram[:,:frame_size]
        #print('Cut shape ouf our mel spectrogram data: ' + str(melgram.shape))
        
    features = np.hstack((melgram))
    return features

def get_features_and_labels(sound_paths):
    features = None
    labels = np.empty(0)
    for p in sound_paths:
        ext_features = extract_features(p)

        if features is None:
            features = np.empty((0,len(ext_features)))
            
        features = np.vstack([features,ext_features])
        
        labels = np.append(labels, int(os.path.basename(p)[0]))
    return np.array(features), np.array(labels, dtype = np.int)

In [None]:
features, labels = get_features_and_labels(sound_paths)

### One-hot encoding:
![encoding](https://www.tensorflow.org/images/feature_columns/categorical_column_with_identity.jpg)

In [None]:
def one_hot_encode(labels):
    n_labels = len(labels)
    n_unique_labels = len(np.unique(labels))
    one_hot_encode = np.zeros((n_labels,n_unique_labels))
    one_hot_encode[np.arange(n_labels), labels] = 1
    return one_hot_encode

In [None]:
labels = one_hot_encode(labels)

### Save Processed Data

In [None]:
FEATURE_PATH = 'features/features.txt'
LABEL_PATH = 'features/labels.txt'

In [None]:
np.savetxt(FEATURE_PATH, features, fmt='%10.5f', delimiter='\t')
np.savetxt(LABEL_PATH, labels, fmt='%i', delimiter='\t')

# Classification

Now, we will load our generated features and labels in order to train a classifier on it and evaluate its performance. 

In [None]:
import random
np.random.seed(3006)
from keras.models import Sequential
from keras.constraints import maxnorm
from keras.initializers import lecun_uniform
from keras import optimizers
from keras.layers import Dense, Dropout, Activation, LSTM
from sklearn.metrics import (accuracy_score, confusion_matrix, precision_recall_fscore_support)
import pandas as pd

## Loading data
Before building the network, we will load the stored data and split the data into three distinct samples: train, test, and eval. 
![Split Expl](https://image.slidesharecdn.com/dbm630-lecture08-120208003610-phpapp01/95/dbm630-lecture08-11-728.jpg?cb=1328661419)

In [None]:
features = np.loadtxt(FEATURE_PATH)
labels = np.loadtxt(LABEL_PATH)

print('Label shape: ' + str(labels.shape))
feature_dim = features.shape[1]
print('Feature dimensions: ' + str(feature_dim))

In [None]:
# Splits our whole dataset in three parts for training, testing, and evaluating our model
def split_train_test_eval (features, labels, train_percentage, test_percentage, eval_percentage):
    feature_label_pairs = list(zip(features, labels))
    random.seed(3006)
    random.shuffle(feature_label_pairs)
    
    features, labels = zip(*feature_label_pairs)
    features = np.array(features)
    labels = np.array(labels)
    
    sample_size = len(labels)
    print('Number of total samples: ' + str(sample_size))
    
    train_samples = int(sample_size * train_percentage)
    test_samples = int(sample_size * test_percentage)
    eval_samples = int(sample_size * eval_percentage)
    
    # just to make sure that we end up with the actual sample size:
    if train_samples + test_samples + eval_samples > sample_size:
        eval_samples = sample_size - train_samples - test_samples
    
    print('Train sample size: ' + str(train_samples))
    print('Test sample size: ' + str(test_samples))
    print('Eval sample size: ' + str(eval_samples))
    
    train_features = features[0 : train_samples]
    train_labels = labels[0 : train_samples]
    
    test_features = features[train_samples : train_samples + test_samples]
    test_labels = labels[train_samples : train_samples + test_samples]
    
    eval_features = features[train_samples + test_samples : train_samples + test_samples + eval_samples]
    eval_labels = labels[train_samples + test_samples : train_samples + test_samples + eval_samples]
    
    return train_features, train_labels, test_features, test_labels, eval_features, eval_labels


In [None]:
train_features, train_labels, test_features, test_labels, eval_features, eval_labels = split_train_test_eval (features, labels, 0.5, 0.3, 0.2)
testing = (test_features, test_labels)
evaluation = (eval_features, eval_labels)

## Building our first model

Let's try it with a DNN with 3 hidden layers
![dnn](https://camo.githubusercontent.com/82b7fff72d1c4da37e0c4474bfd0cdd06b1a6a75/687474703a2f2f74656c656772612e70682f66696c652f3137356133343032346263343536353164306265362e706e67)


Defining the network achitecture

In [None]:
model = Sequential()
model.add(Dense(units=50, input_dim=feature_dim, activation="relu"))
model.add(Dense(units=20,activation="relu"))
model.add(Dense(units=15, activation="relu"))
model.add(Dense(units=10,activation="sigmoid"))

Training the model:
1. Define the optimizer
2. Compile the the defined model
3. Train it

In [None]:
opt = optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)
model.compile(loss="binary_crossentropy",
              optimizer=opt,
              metrics=["accuracy"])

model.fit(train_features,train_labels, validation_data=evaluation, epochs=20, batch_size=1)



MODEL_DIR = "models/model1.model"

model.save(MODEL_DIR)

Score the model

In [None]:
prediction_probabilities = np.array(model.predict_proba(test_features))
prediction = np.array(model.predict_classes(test_features))
                                            
test_classes = np.argmax(test_labels, axis=1)
print(prediction)
print(test_classes)

accuracy = accuracy_score(test_classes, prediction)
print('Accuracy: ' + str(accuracy))

pd.crosstab(test_classes, prediction, rownames=['True'], colnames=['Predicted'], margins=True)

# Back to the presentation
Want to learn some more stuff?

### The next level
Lets build a simple LSTM powered network.

![lstm](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

Reshape the data

In [None]:
print(train_features.shape)
train_features = train_features.reshape((len(train_features), 1, feature_dim))
test_features = test_features.reshape((len(test_features), 1, feature_dim))
print(train_features.shape)

Define the network architecture

In [None]:
# Neural network with LSTM
model = Sequential()
model.add(LSTM(50,return_sequences=True, input_shape=(1, feature_dim)))
model.add(LSTM(20, return_sequences=True, input_shape=(1, 50)))
model.add(LSTM(15))
model.add(Dense(10))

Train the network

In [None]:
model.compile(loss='mean_squared_error', 
              optimizer='adam',
              metrics=["accuracy"])

model.fit(train_features, train_labels, epochs=20, batch_size=1, verbose=2)

Score model

In [None]:
# reshape for lstm
prediction = np.array(model.predict_classes(test_features))
                                            
test_classes = np.argmax(test_labels, axis=1)
print(prediction)
print(test_classes)

accuracy = accuracy_score(test_classes, prediction)
print('Accuracy: ' + str(accuracy))

conf_mat = confusion_matrix(test_classes, prediction)
print(conf_mat)