## **Problem 1: Speech Denoising Using 1D CNN**




** Implement a 1D CNN that does the speech denoising in the STFT magnitude domain. 1D CNN here means a variant of CNN which does the convolution operation along only one of the axis. In our case it's the frequency axis. **

**Implementation Approach:**


I have constructed 1D CNN with the following structure:


*  ** Network Layers:** Input Layer + 2 Convolutional Layers + One Fully Connected Layer + One output  Layer


*  ** Input: **The input is shaped to [-1,513,1])

*  ** Convolutional layers: ** 2 Convolutional Layers with the following setting:
                                              
                    - Layer 1: No. of filters = 32, filter size = 16, strides 1 with maxpooling of (2X2) and stride of 2
                    - Layer 2: No. of filters = 64, filter size = 8, strides 1 with maxpooling of (2X2) and stride of 2


*   **Fully Connected Layer: **2048 nodes
*   **Output Layer: **513 nodes

*   **Batch Size: **128

*   **Number of epochs: **1200


*   **Activiation Function:**  Relu activation function in all layers 

*   **Initializer: **All weights are inialized using He initialization

*   **Learning Rate:** 0.0002

*   **Optimizer:** Adam Optimizer

*   **Loss Function:** Mean Squared Error
*   **drop out:** to avoid overfitting dropout rate of 0.4 is used 

***Results***

*   **Loss in training data:**  0.02 to 0.03

*   **Calculated SNR for training data:** 18 to 19 dB

Import the libraries

In [0]:
# import the needed libraries
import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
import librosa


In [0]:
# in colab, you'll need to install this

#!pip install librosa 
import librosa

Import train input and output data

In [397]:

s, sr=librosa.load("train_clean_male.wav", sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)


sn, sr=librosa.load("train_dirty_male.wav", sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)

X.shape

(513, 2459)

In [398]:
# Transpose the training data to get the data samples in rows and features in columns and then take the absolute values of the STFT data
abs_S = np.abs(S.T)
abs_X = np.abs(X.T)

print(S.shape)
print(X.shape)

(513, 2459)
(513, 2459)


Import testing data

In [399]:

s1, sr=librosa.load('test_x_01.wav', sr=None)
S1 =librosa.stft(s1, n_fft=1024, hop_length=512)
s2, sr=librosa.load('test_x_02.wav', sr=None)
S2 =librosa.stft(s2, n_fft=1024, hop_length=512)


#import training data for testing as well
s3, sr=librosa.load('train_dirty_male.wav', sr=None)
S3 =librosa.stft(s3, n_fft=1024, hop_length=512)
print(S1.shape, S2.shape, S3.shape)

(513, 142) (513, 380) (513, 2459)


In [400]:
# import test data
test1, sr=librosa.load('test_x_01.wav', sr=None)
test1_stft =librosa.stft(test1, n_fft=1024, hop_length=512)

test2, sr=librosa.load('test_x_02.wav', sr=None)
test2_stft =librosa.stft(test2, n_fft=1024, hop_length=512)



#import training data for testing as well
training, sr=librosa.load('train_dirty_male.wav', sr=None)
training_stft =librosa.stft(training, n_fft=1024, hop_length=512)
print(test1_stft.shape, test2_stft.shape, training_stft.shape)

(513, 142) (513, 380) (513, 2459)


In [401]:
print(S1.shape, S2.shape, S3.shape)

(513, 142) (513, 380) (513, 2459)


Transpose the testing data and then take the absolute values of the STFT data

In [0]:

test1 = np.abs(test1_stft.T)
test2 = np.abs(test2_stft.T)
training = np.abs(training_stft.T)

The shape of tesing data

In [403]:
print(test1.shape)
print(test2.shape)

(142, 513)
(380, 513)


Create placeholder for the input and ouput variables

In [0]:
x = tf.placeholder('float', shape=(None,513))
y = tf.placeholder('float',shape=(None,513))

 

Probability for dropouts

In [0]:
keep_prob = tf.placeholder("float") 


 Set the batch size

In [0]:
# batch Size
batch_size=128


The below Conv1D_CNN function is used to construct the  2 convolutional layers,  one fully connected layer and ouput layer using Relu activation function in all layers including the ouput layer. I have also used He Initialization technique to initialize the weights and bias. In addition, AdamOptimizer is used to optimize the loss. 

In [0]:
def Conv1D_CNN(x, keep_prob):
  
    # reshape the input, 513 is the width of the input , 1 is the number of channe, -1 is for batch size
    input_layer = tf.reshape(x,[-1,513,1])
    
    
    # use He to initialize all weights
    initializer = tf.contrib.layers.variance_scaling_initializer(factor=2.0 , mode='FAN_IN',uniform=False, dtype=tf.float32)


    
    
    
    
    # Convolution layer 1
    conv_layer_1 = tf.layers.conv1d(input_layer, 
                                    filters=16, kernel_size=16, strides=1,
                                    padding='same', activation = tf.nn.relu,kernel_initializer =initializer,
                                    bias_initializer =initializer) 
    
    #print('çonv1', conv_layer_1.shape)
    # max pooling for layer 1
    
    max_pool_1 = tf.layers.max_pooling1d(conv_layer_1, pool_size=2, strides=2, padding='valid')
    
    
    
    
    
    
    # Convolution layer 2
    conv_layer_2 = tf.layers.conv1d(inputs=max_pool_1, 
                                    filters=32, kernel_size=8, strides=1,
                                    padding='same', activation = tf.nn.relu, kernel_initializer =initializer, 
                                    bias_initializer =initializer)
      
    max_pool_2 = tf.layers.max_pooling1d(inputs=conv_layer_2, pool_size=2, strides=2, padding='valid')
    
    

    
    conv2_flat = tf.contrib.layers.flatten(max_pool_2)
    
    
    
    
    # Fully connected layer
    
    fc1 = tf.layers.dense(conv2_flat, 2048, activation = tf.nn.relu,kernel_initializer =initializer , bias_initializer =initializer)
    
    fc1 = tf.nn.dropout(fc1, rate = 1 - keep_prob)


    
    

    
    # Output layer, class prediction
    output = tf.layers.dense(fc1, 513,  activation = tf.nn.relu,kernel_initializer =initializer, bias_initializer =initializer )

    return output

**The below function is used to train the connected network and calculate the loss and optimize it using Adam optimizer.**

mean_squared_error function is used to calculate the loss. However, the model has given loss of around % of 0.03 for the training data

In [0]:
def train_nets(x):
  
    prediction = Conv1D_CNN(x, keep_prob)
    
    cost = tf.losses.mean_squared_error(y,prediction)
    
    train_step = tf.train.AdamOptimizer(learning_rate= 0.0002).minimize(cost)
    saver = tf.train.Saver()
    
    epochs = 1200
    
    with tf.Session() as sess:
      
        sess.run(tf.global_variables_initializer())
        
        for epoch in range(epochs):
            epoch_loss = 0
            start_index = 0
            
            for _ in range(int(abs_X.shape[0]/batch_size)):
                end_index = start_index +batch_size
                if end_index > abs_X.shape[0]:
                    end_index = abs_X.shape[0]
                batch_x = abs_X[start_index:end_index]
                batch_y = abs_S[start_index: end_index]
                start_index = end_index + 1
                _, err = sess.run([train_step, cost], feed_dict={x: batch_x, y: batch_y, keep_prob: 0.6}) 

                epoch_loss += err
            for i in range(epoch % 50 == 0):
                print('Epoch ',epoch, ' completed out of ',epochs, 'loss: ', epoch_loss)
        print('Epoch ',epoch, ' completed out of ',epochs, 'loss: ', epoch_loss)
        
       
        test1_pred = sess.run(prediction, feed_dict = {x: test1, keep_prob: 1.0})
        test2_pred = sess.run(prediction, feed_dict = {x: test2, keep_prob: 1.0})
        training_pred = sess.run(prediction, feed_dict = {x: training, keep_prob: 1.0})
        
        return test1_pred, test2_pred,training_pred

In [409]:

test1_pred, test2_pred, training_pred= train_nets(x)

Epoch  0  completed out of  1200 loss:  2.9229963812977076
Epoch  50  completed out of  1200 loss:  0.229036383330822
Epoch  100  completed out of  1200 loss:  0.15503776725381613
Epoch  150  completed out of  1200 loss:  0.12003292376175523
Epoch  200  completed out of  1200 loss:  0.10234591574408114
Epoch  250  completed out of  1200 loss:  0.08147533680312335
Epoch  300  completed out of  1200 loss:  0.07628886087331921
Epoch  350  completed out of  1200 loss:  0.06951843132264912
Epoch  400  completed out of  1200 loss:  0.065176184871234
Epoch  450  completed out of  1200 loss:  0.0698048184858635
Epoch  500  completed out of  1200 loss:  0.05055038526188582
Epoch  550  completed out of  1200 loss:  0.05727709620259702
Epoch  600  completed out of  1200 loss:  0.04638114990666509
Epoch  650  completed out of  1200 loss:  0.04630673653446138
Epoch  700  completed out of  1200 loss:  0.04518081352580339
Epoch  750  completed out of  1200 loss:  0.048253586050122976
Epoch  800  comp

In [0]:
# Recover the time domain speech signal
out1 = test1_pred.T * (S1/test1.T)


out2 = (S2/test2.T)* test2_pred.T
out3 = (S3/training.T)* training_pred.T

In [0]:
# apply STFT
test1_recons = librosa.istft(out1, win_length= 1024, hop_length=512)
test2_recons = librosa.istft(out2, win_length= 1024, hop_length=512)
test3_recons = librosa.istft(out3, win_length= 1024, hop_length=512)


In [0]:
# write it out

librosa.output.write_wav('test_s_01_recons.wav', test1_recons, sr)
librosa.output.write_wav('test_s_02_recons.wav', test2_recons, sr)
librosa.output.write_wav('training_s_03_recons.wav', test3_recons, sr)

Calculate **SNR** for training data

In [413]:
# Recover the time domain speech signal

out3 = training_pred.T * (X/np.abs(X))
test3_recons = librosa.istft(out3, win_length= 1024, hop_length=512)
s_clean = s[0:test3_recons.size]
SNR = 10*np.log10(np.dot(s_clean.T,s_clean)/np.dot((s_clean - test3_recons).T,(s_clean - test3_recons)))
SNR

18.893262147903442




References:


*   https://www.tensorflow.org/tutorials/estimators/cnn
*   https://stackoverflow.com/questions/38114534/basic-1d-convolution-in-tensorflow
*  https://www.datacamp.com/community/tutorials/cnn-tensorflow-python




