## **Problem 2: Speech Denoising Using 2D CNN**




** Implement a 2D CNN that does the speech denoising in the STFT magnitude domain. 2D CNN here means a variant of CNN which does the convolution operation along two of the axis. It's both the
width (frequencies) and the height axes (frames) **

**Implementation Approach:**


I have constructed 2D CNN with the following structure:


*  ** Network Layers:** Input Layer + 2 Convolutional Layers + One Fully Connected Layer + One output  Layer


*  ** Input: **The input is shaped to [-1, 20, 513,1]

*  ** Convolutional layers: ** 2 Convolutional Layers with the following setting:
                                              
                    - Layer 1: No. of filters = 16, filter size = 4, strides 1 with maxpooling of (2X2) and stride of 2
                    - Layer 2: No. of filters = 32, filter size = 2, strides 1 with maxpooling of (2X2) and stride of 2


*   **Fully Connected Layer: **2048 nodes

*   **Output Layer: **513 nodes

*   **Batch Size: **128

*   **Number of epochs: **1300


*   **Activiation Function:**  Relu activation function in all layers 

*   **Initializer: **All weights are inialized using He initialization

*   **Learning Rate:** 0.0002

*   **Optimizer:** Adam Optimizer

*   **Loss Function:** Mean Squared Error
*   **drop out:** to avoid overfitting dropout rate of 0.4 is used 

***Results***

*   **Loss in training data:**   0.04

*   **Calculated the SNR for training data:** 16 to 17 dB

In [0]:
# import the needed libraries
import tensorflow as tf
import numpy as np
import os
import matplotlib.pyplot as plt
import librosa


In [0]:
# in colab, you'll need to install this

#!pip install librosa 
import librosa

Import the training data

In [27]:
# Import train input and output data
s, sr=librosa.load("train_clean_male.wav", sr=None)
S=librosa.stft(s, n_fft=1024, hop_length=512)


sn, sr=librosa.load("train_dirty_male.wav", sr=None)
X=librosa.stft(sn, n_fft=1024, hop_length=512)



# transpose the training data to get the data samples in rows and features in columns and then take the absolute values of the STFT data
abs_S = np.abs(S.T)
abs_X = np.abs(X.T)

print ('train clean', abs_S.shape)
print ('train dirty', abs_X.shape)




train clean (2459, 513)
train dirty (2459, 513)


Import the testing data

In [28]:
# import test data
s1, sr=librosa.load('test_x_01.wav', sr=None)
S1 =librosa.stft(s1, n_fft=1024, hop_length=512)
s2, sr=librosa.load('test_x_02.wav', sr=None)
S2 =librosa.stft(s2, n_fft=1024, hop_length=512)

#import training data for testing as well
s3, sr=librosa.load('train_dirty_male.wav', sr=None)
S3 =librosa.stft(s3, n_fft=1024, hop_length=512)
print(S1.shape, S2.shape, S3.shape)

training = np.abs(S3.T)


(513, 142) (513, 380) (513, 2459)


Generate the input as 20 frames shifted by one frame:

In [29]:
# Generate 2D image for training data
training_2D= np.array([np.reshape(abs_S[i:i+20], (20, 513)) for i in range(2440)])


print('reshaped training', training_2D.shape)

y_= abs_S[19:]

print(y_.shape)

reshaped training (2440, 20, 513)
(2440, 513)


In [30]:
# Generate 2D image for testing data


test1 = np.abs(S1.T)

test1_2D = np.array([np.reshape(test1[i:i+20], (20, 513)) for i in range (123)])

print('reshaped test1', test1_2D.shape)


test2 = np.abs(S2.T)
test2_2D = np.array([np.reshape(test2[i:i+20], (20, 513)) for i in range(361)])

print('reshaped test2', test2_2D.shape)





reshaped test1 (123, 20, 513)
reshaped test2 (361, 20, 513)


Create placeholder for the input and ouput variables

In [0]:
x = tf.placeholder('float', shape=(None,20,513))
y = tf.placeholder('float',shape=(None,513))

# probability for dropouts
keep_prob = tf.placeholder("float")  

 Set the batch size

In [0]:
# batch Size
batch_size=128


The below Conv2D_CNN function is used to construct the network with  2 conolutional layers, one fully comnnected layer and output layer using Relu activation function. I have also used He Initialization technique to initialize the weights and bias. In addition, AdamOptimizer is used to optimize the loss. 

In [0]:
def Conv2D_CNN(x, keep_prob):
  
    # reshape the input
    input_layer = tf.reshape(x, [-1, 20, 513,1])
    
    
    # use He to initialize all weights
    initializer = tf.contrib.layers.variance_scaling_initializer(factor=2.0 , mode='FAN_IN',uniform=False, dtype=tf.float32)
   

    

    # Convolution layer 1
    conv_layer_1 = tf.layers.conv2d(input_layer, 
                                    filters=16, kernel_size=4, strides=1,
                                    padding='same', activation = tf.nn.relu,kernel_initializer =initializer,
                                    bias_initializer =initializer) 
    
    max_pool_1 = tf.layers.max_pooling2d(conv_layer_1, pool_size=2, strides=2, padding='valid')
    
   
    
    
    # Convolution layer 2
    conv_layer_2 = tf.layers.conv2d(inputs=max_pool_1, 
                                    filters=32, kernel_size=2, strides=1,
                                    padding='same', activation = tf.nn.relu, kernel_initializer =initializer, 
                                    bias_initializer =initializer)
 

    max_pool_2 = tf.layers.max_pooling2d(conv_layer_2, pool_size=2, strides=2, padding='valid')
   
    
    
    #flatten the output of max pooling layer
    
    
    conv2_flat = tf.contrib.layers.flatten(max_pool_2)
    

    # Fully connected layer
    
    fc1 = tf.layers.dense(conv2_flat, 2048 , activation = tf.nn.relu,kernel_initializer =initializer , bias_initializer =initializer)
    
    fc1 = tf.nn.dropout(fc1,rate = 1 - keep_prob)


    # Output layer, class prediction
    output = tf.layers.dense(fc1, 513,  activation = tf.nn.relu,kernel_initializer =initializer, bias_initializer =initializer )

    return output

**The below function is used to train the connected network and calculate the loss and optimize it using Adam optimizer.**

mean_squared_error function is used to calculate the loss. However, the model has given loss of around % of 0.04 for the training data

In [0]:
def train_nets(x):
  
    prediction = Conv2D_CNN(x, keep_prob)
    
    cost = tf.losses.mean_squared_error(y,prediction)
    train_step = tf.train.AdamOptimizer(learning_rate= 0.0002).minimize(cost)
    saver = tf.train.Saver()
    
    epochs = 1300
    
    with tf.Session() as sess:
      
        sess.run(tf.global_variables_initializer())
        
        for epoch in range(epochs):
            epoch_loss = 0
            start_index = 0
            
            for _ in range(int(training_2D.shape[0]/batch_size)):
                end_index = start_index +batch_size
                if end_index > training_2D.shape[0]:
                    end_index = training_2D.shape[0]
                batch_x = training_2D[start_index:end_index]
                batch_y = y_[start_index: end_index]
                
                start_index = end_index + 1
                _, err = sess.run([train_step, cost], feed_dict={x: batch_x, y: batch_y, keep_prob: 0.6}) 

                epoch_loss += err
            for i in range(epoch % 50 == 0):
                print('Epoch ',epoch, ' completed out of ',epochs, 'loss: ', epoch_loss)
        print('Epoch ',epoch, ' completed out of ',epochs, 'loss: ', epoch_loss)
        
       
        test1_pred = sess.run(prediction, feed_dict = {x: test1_2D, keep_prob: 1.0})
        test2_pred = sess.run(prediction, feed_dict = {x: test2_2D, keep_prob: 1.0})
        training_pred = sess.run(prediction, feed_dict = {x: training_2D, keep_prob: 1.0})
        
        return test1_pred, test2_pred,training_pred
       

In [35]:

test1_pred, test2_pred, training_pred= train_nets(x)

Epoch  0  completed out of  1300 loss:  3.889369387179613
Epoch  50  completed out of  1300 loss:  0.4967529349960387
Epoch  100  completed out of  1300 loss:  0.36544008273631334
Epoch  150  completed out of  1300 loss:  0.234587034676224
Epoch  200  completed out of  1300 loss:  0.19846920482814312
Epoch  250  completed out of  1300 loss:  0.16428576014004648
Epoch  300  completed out of  1300 loss:  0.14594825624953955
Epoch  350  completed out of  1300 loss:  0.12630866165272892
Epoch  400  completed out of  1300 loss:  0.11213923676405102
Epoch  450  completed out of  1300 loss:  0.10236282797995955
Epoch  500  completed out of  1300 loss:  0.09017294715158641
Epoch  550  completed out of  1300 loss:  0.08401874208357185
Epoch  600  completed out of  1300 loss:  0.0778257978381589
Epoch  650  completed out of  1300 loss:  0.06892852147575468
Epoch  700  completed out of  1300 loss:  0.0698070481303148
Epoch  750  completed out of  1300 loss:  0.06529376906109974
Epoch  800  comple

Augment the output  with 19 silent frames

In [36]:
# Generate 19 silent frames


missing_frames = np.array(np.random.uniform(0,0.1, size = (19,513))/100000)

print(missing_frames.shape)

(19, 513)


In [0]:
# Add the silent frames to the predicted data
test1_pred= np.vstack((missing_frames, test1_pred))


test2_pred= np.vstack((missing_frames, test2_pred))


training_pred = np.vstack((missing_frames, training_pred))


In [0]:
# Recover the time domain speech signal
out1 = test1_pred.T * (S1/test1.T)


out2 = (S2/test2.T)* test2_pred.T
out3 = (S3/training.T)* training_pred.T

In [0]:
# Apply STFT
test1_recons = librosa.istft(out1, win_length= 1024, hop_length=512)
test2_recons = librosa.istft(out2, win_length= 1024, hop_length=512)
test3_recons = librosa.istft(out3, win_length= 1024, hop_length=512)


In [0]:
# write it out

librosa.output.write_wav('test_s_01_recons.wav', test1_recons, sr)
librosa.output.write_wav('test_s_02_recons.wav', test2_recons, sr)
librosa.output.write_wav('training_s_03_recons.wav', test3_recons, sr)

Calculate  the **SNR** for training data

In [41]:
# calculate SNR

out3 = training_pred.T * (X/np.abs(X))
test3_recons = librosa.istft(out3, win_length= 1024, hop_length=512)
s_clean = s[0:test3_recons.size]
SNR = 10*np.log10(np.dot(s_clean.T,s_clean)/np.dot((s_clean - test3_recons).T,(s_clean - test3_recons)))
SNR

16.705366373062134

References:


1.   https://www.datacamp.com/community/tutorials/cnn-tensorflow-python
1.   https://www.tensorflow.org/tutorials/estimators/cnn
2.   https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/02_Convolutional_Neural_Network.ipynb




