## Preface
This notebooks aims to build a light-weight CNN.
It uses specgrams of resampled wav files(rate 8000) as inputs.
Due to Kaggle cloud hardware limitations, this script is a 'crippled' version of the original one.
In order to get LB 0.74, you need to set epoch to 5, set chop_audio(num=1000) and double all Conv layer parameters.
I haven't tuned the parameters for the CNN model here.

## File Structure
This script assumes data are stored in following strcuture:
speech
├── test            
│   └── audio #test wavfiles
├── train           
│   ├── audio #train wavfiles
└── model #store models
│
└── out #store sub.csv

## Possible Improvements
Since this is only a light-weight CNN, it's performance is limited.
Here are some ways to improve it's performance.
1. Use original wav files instead resampled ones.
2. Create more 'silence' wav files using chop_audio.
3. Build deeper CNN or use RNN.
4. Train for longer epochs

## After Words
It's still a long way to reach LB 0.88.
In fact, I doubt CNN would ever reach that high.
Feel free to share your ideas in the comment sections about using CNN to label wav files :)

## Appendix
Thanks __DavidS__ and __Alex Ozerin__ for their great notebooks!

In [14]:
import os
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from glob import glob
import re
import pandas as pd
import gc
from scipy.io import wavfile

# from keras import optimizers, losses, activations, models
# from keras.layers import Convolution2D, Dense, Input, Flatten, Dropout, MaxPooling2D, BatchNormalization
from sklearn.model_selection import train_test_split
import torch
from torch import nn
from torch.autograd import Variable

The original sample rate is 16000, and we will resample it to 8000 to reduce data size.

In [3]:
L = 16000
legal_labels = 'yes no up down left right on off stop go silence unknown'.split()

#src folders
root_path = r'../data/'
out_path = r'.'
model_path = r'.'
train_data_path = os.path.join(root_path, 'train', 'audio')
test_data_path = os.path.join(root_path, 'test', 'audio')

Here are custom_fft and log_specgram functions written by __DavidS__.

In [4]:
def custom_fft(y, fs):
    T = 1.0 / fs
    N = y.shape[0]
    yf = fft(y)
    xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
    # FFT is simmetrical, so we take just the first half
    # FFT is also complex, to we take just the real part (abs)
    vals = 2.0/N * np.abs(yf[0:N//2])
    return xf, vals

def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

Following is the utility function to grab all wav files inside train data folder.

In [5]:
def list_wavs_fname(dirpath, ext='wav'):
    print(dirpath)
    fpaths = glob(os.path.join(dirpath, r'*/*' + ext))
    pat = r'.+/(\w+)/\w+\.' + ext + '$'
    labels = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            labels.append(r.group(1))
    pat = r'.+/(\w+\.' + ext + ')$'
    fnames = []
    for fpath in fpaths:
        r = re.match(pat, fpath)
        if r:
            fnames.append(r.group(1))
    return labels, fnames

__pad_audio__ will pad audios that are less than 16000(1 second) with 0s to make them all have the same length.

__chop_audio__ will chop audios that are larger than 16000(eg. wav files in background noises folder) to 16000 in length. In addition, it will create several chunks out of one large wav files given the parameter 'num'.

__label_transform__ transform labels into dummies values. It's used in combination with softmax to predict the label.

In [6]:
def pad_audio(samples):
    if len(samples) >= L: return samples
    else: return np.pad(samples, pad_width=(L - len(samples), 0), mode='constant', constant_values=(0, 0))

def chop_audio(samples, L=16000, num=20):
    for i in range(num):
        beg = np.random.randint(0, len(samples) - L)
        yield samples[beg: beg + L]

def label_transform(labels):
    nlabels = []
    for label in labels:
        if label == '_background_noise_':
            nlabels.append('silence')
        elif label not in legal_labels:
            nlabels.append('unknown')
        else:
            nlabels.append(label)
    return pd.get_dummies(pd.Series(nlabels))

Next, we use functions declared above to generate x_train and y_train.
label_index is the index used by pandas to create dummy values, we need to save it for later use.

In [7]:
labels, fnames = list_wavs_fname(train_data_path)

new_sample_rate = 8000
y_train = []
x_train = []

for label, fname in zip(labels, fnames):
    path = os.path.join(train_data_path, label, fname)
    sample_rate, samples = wavfile.read(path)
    samples = pad_audio(samples)
    if len(samples) > 16000:
        n_samples = chop_audio(samples)
    else: n_samples = [samples]
    for samples in n_samples:
        resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
        _, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
        y_train.append(label)
        x_train.append(specgram)
    
x_train = np.array(x_train)
x_train = x_train.reshape(tuple(list(x_train.shape) + [1]))
y_train = label_transform(y_train)
label_index = y_train.columns.values
y_train = y_train.values
y_train = np.array(y_train)
del labels, fnames
gc.collect()

../data/train/audio




2029

CNN declared below.
The specgram created will be of shape (99, 81), but in order to fit into Conv2D layer, we need to reshape it.

In [17]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.features = nn.Sequential(
            nn.BatchNorm2d(1),
            nn.Conv2d(1, 8, 2),
            nn.ReLU(),
            nn.Conv2d(8, 8, 2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(p=0.2),
            nn.Conv2d(8, 16, 3),
            nn.ReLU(),
            nn.Conv2d(16, 16, 3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(p=0.2),
            nn.Conv2d(16, 32, 3),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout(p=0.2)
        )

        self.classifier = nn.Sequential(
            nn.Linear(2240, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Linear(128, 12),
            nn.Softmax()
        )
        
    def forward(self, x):
        out = self.features(x)
        out = out.view(out.size(0), -1)
        out = self.classifier(out)
#         x = f.view(-1, 2240)
        return out

In [10]:
X_train, X_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.3, random_state=42)

In [23]:
cnn = CNN()

num_epochs = 6

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn.parameters(), lr=1e-3)
batch_size = 32

# Train the Model
for epoch in range(num_epochs):
    for i in range(X_train.shape[0] // batch_size):
        x = X_train[i * batch_size: (i + 1) * batch_size]
        y = y_train[i * batch_size: (i + 1) * batch_size]
        
        images = torch.from_numpy(x).permute(0, 3, 1, 2)
        labels = torch.from_numpy(np.argmax(y, axis=1))
        
        images = Variable(images)
        labels = Variable(labels)
        
        # Forward + Backward + Optimize
        optimizer.zero_grad()
        outputs = cnn(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f' 
                   %(epoch+1, num_epochs, i+1, X_train.shape[0] //batch_size, loss.data[0]))

  input = module(input)


Epoch [1/6], Iter [100/1418] Loss: 2.0868
Epoch [1/6], Iter [200/1418] Loss: 2.0543
Epoch [1/6], Iter [300/1418] Loss: 1.9563
Epoch [1/6], Iter [400/1418] Loss: 1.9324
Epoch [1/6], Iter [500/1418] Loss: 2.0255
Epoch [1/6], Iter [600/1418] Loss: 1.9317
Epoch [1/6], Iter [700/1418] Loss: 2.0565
Epoch [1/6], Iter [800/1418] Loss: 1.9627
Epoch [1/6], Iter [900/1418] Loss: 1.9939
Epoch [1/6], Iter [1000/1418] Loss: 2.1188
Epoch [1/6], Iter [1100/1418] Loss: 2.1188
Epoch [1/6], Iter [1200/1418] Loss: 2.0251
Epoch [1/6], Iter [1300/1418] Loss: 1.8688
Epoch [1/6], Iter [1400/1418] Loss: 1.9625
Epoch [2/6], Iter [100/1418] Loss: 1.9000
Epoch [2/6], Iter [200/1418] Loss: 2.0250
Epoch [2/6], Iter [300/1418] Loss: 1.9000
Epoch [2/6], Iter [400/1418] Loss: 1.9313
Epoch [2/6], Iter [500/1418] Loss: 2.0250
Epoch [2/6], Iter [600/1418] Loss: 1.9313
Epoch [2/6], Iter [700/1418] Loss: 2.0563
Epoch [2/6], Iter [800/1418] Loss: 1.9625
Epoch [2/6], Iter [900/1418] Loss: 1.9937
Epoch [2/6], Iter [1000/1418]

In [24]:
# Test the Model
cnn.eval()  # Change model to 'eval' mode (BN uses moving mean/var).
correct = 0
total = 0
for i in range(X_test.shape[0] // batch_size):
    x = X_test[i * batch_size: (i + 1) * batch_size]
    y = y_test[i * batch_size: (i + 1) * batch_size]    
    
    images = torch.from_numpy(x).permute(0, 3, 1, 2)
    labels = torch.from_numpy(np.argmax(y, axis=1))
    
    images = Variable(images)
    outputs = cnn(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()

print('Test Accuracy of the model on the 10000 test images: %d %%' % (100 * correct / total))

# Save the Trained Model
torch.save(cnn.state_dict(), 'cnn.pkl')

  input = module(input)


Test Accuracy of the model on the 10000 test images: 63 %
