# ACOUSTIC ANALYSIS AND CLASSIFICATION 

### by Pavithra Govardhanan and Shree Harini Ravichandran, December 17, 2019


### INTRODUCTION 
 
The datasets in real life are either in structured or unstructured format. When the data is available in structured format like a csv file with continuous or discrete values, it is easy to process and analyse them. However when the data is available in unstructured format like an image or audio file, processing the data to fit for analysis becomes difficult. 

Interestingly, such unstructured data contain useful information and is not explored by many. For instance, an image or audio file not only give background information, but also details on what it is trying to convey. For example, Identification, Recognition, Authentication and Emotional analysis can be done on such unstructured datasets.

In our everyday life we hear different kinds of acoustics like speech, noise, music and from the environment. Our brain continuously keeps processing and understanding the audio that we provide. There are devices like sensors and microphones that catch these sounds and helps to represent it in computer readable formats like wav (Waveform Audio File) format, MP3 (MPEG-1 audio layer 3) format, WMA (Windows Media Audio) format etc. 

In our project we analyse audio samples obtained from a microphone, preprocess the audio sample to be fit for analysis and classify musical instruments. The dataset contains audio files of musical instruments and instrument name as class label. We apply digital signal processing concepts to preprocess and get required features to train the model. Convoluted Neural Network and Recurrent Neural Network are used to build the model. The model when tested on a given audio sample, outputs the musical instrument class.

### METHODS

Natural language processing (NLP) is the ability of a computer program to understand human language and  is a component of artificial intelligence (AI). This is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. NLP can be used to interpret free text and make it analysable. Sentiment analysis is another primary use case for NLP.

This project on instrument identification system, involves few Digital Signal Processing concepts. The mathematical DSP concepts that we used in preprocessing the audio samples are mentioned below:

<ul>
    <li>Fast Fourier Transform</li>
    <li>Short Term Fast Fourier Transform</li>
    <li>Discrete Cosine Transform</li> 
</ul>    

**Fast Fourier Transform**

Fast Fourier Transform is an important algorithm in the field of audio and acoustic measurements. It converts a signal from its original time or space domain into a frequency domain thereby giving frequency information about the signal. 

![fft.jpeg](attachment:fft.jpeg)

<center><b>Figure 1 FFT representation<b></center> 

Fast fourier Transform is an optimised algorithm for implementing Discrete Fourier Transform (DFT). Figure 1 is an FFT representation. If we take the 2-point DFT and 4-pointDFT and generalize them to 8-point, 16-point, ..., 2r-point, we get the FFT algorithm. The formula to compute the Fast Fourier Transform is given as, 

$$F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx$$


**Short Term Fast Fourier Transform**

The Short-Time Fourier Transform is one of the techniques for linear frequency analysis. It defines a particularly useful class of time-frequency distributions and specifies complex amplitude versus time and frequency for any signal. 


$$\sum_{n = -\infty}^{\infty} x(n) w(n - mR)e^{-jwn}$$


where,

x(n) = input signal at time t

W(n) = length M window function (Hanning window)

Xm(w) = DFTT of window data centered about time mR

R = hop size

<center><div style="width: 400px;"![llustration-of-short-time-Fourier-transform-on-the-test-signal-xt.png](attachment:llustration-of-short-time-Fourier-transform-on-the-test-signal-xt.png)>
<img src="attachment:llustration-of-short-time-Fourier-transform-on-the-test-signal-xt.png" width="400"></div> </center>


<center><b>Figure 2 Representation of Short Term Fast Fourier Transform<b></center> 


Figure 2 denotes the Short term FFT. Visualization of Short Term Fourier Transform is often comprehended via its spectrogram. It is an intensity plot of STFT magnitude over time.

**Discrete Cosine Transform**

The discrete cosine transform (DCT) is used to divide the image into parts or spectral sub-bands of differing importance thereby removing unwanted overlaps. The DCT is almost identical to the Discrete Fourier transform where it transforms a signal or image from the spatial domain to the frequency domain.

$$F(u) = \dfrac{2}{N}\sum_{i = 0}^{ 1/2 N - 1} \Lambda (i).cos \left[{\dfrac{\pi.u}{2.N} (2i + 1)} \right]f(i)$$  


### Dataset

The dataset that was used for implementing this project contained audio file samples in .wav format and labels that mentioned the instrument name. A WAV file is a raw audio file and are not compressed. Therefore, these files take up a lot of space. Each audio sample was unique and with a mixture of these instruments. The dataset comprised of three hundred unique audio files which included ten different musical instruments and each musical instrument had about 30 recordings. The saxophone, Violin_or_fiddle, hi_hat, snare drum, acoustic_guitar, double_bass, cello, bass_drum, flute and clarinet were some of the musical instruments that were included in the dataset. We also maintained one .csv file which had two columns; the first column fname had the names of the wav files and a second label column that contained the name of the instrument.

### Overview of the project

The steps on how this project was implemented are as follows:

- Data Collection 
- Data Pre-Processing
- Training the model
- Testing the model

**1. Data Collection:**

The audio samples(.wav file format) for the ten different instruments were collected and was cropped to about 10 to 15 seconds approximately. This was done to remove the unwanted noise and low frequency ranges in the audio image. Since there were thirty different audio samples for each instrument class we  obtained the average length of each instrument to identify the class distribution and it is shown in figure 3.

<center><div style="width:200px;"![label.jpeg](attachment:label.jpeg)>
<img src="attachment:label.jpeg" width= 200"></div> </center>

<center><b>Figure 3 Average length of each instrument class<b></center> 


From the pie chart visualization(Figure 4), we can see that the flute has the highest class distribution. This implies that the average audio length was highest for flute and lowest for bass drum. 

![classdistr.png](attachment:classdistr.png)

<center><b>Figure 4 Instrument class distribution<b></center> 

Then the cropped audio samples were converted to audio signals image using Librosa package. Librosa is a python package for audio and music analysis which provides the elementary units necessary to create music information retrieval systems.

**2. Data Pre-Processing:**

Fast Fourier Transform was applied to the audio signal that displayed amplitude, and a Periodogram was obtained. The samples of periodogram are stacked for every 10th of a second each, to obtain the spectrogram. Short term Fourier Transform was applied to this spectrogram to obtain Processed Spectrogram. Then to this we apply Discrete Cosine Transform to obtain the Mel Frequency  Cepstrum Coefficients (Features). MFCC contains 13 features which are the necessary features to train the model.
    
**a. Plotting the signal:** 

The audio signal is plotted as amplitudes over time. Since the dataset consists of 10 unique classes, this process is repeated for all 10 classes of instruments and the graph obtained is shown in figure 5.

![audioimg.png](attachment:audioimg.png)


<center><b>Figure 5 Audio signal representation<b></center> 


**b. Periodogram:**

The periodogram shown in figure 6 is obtained by applying Fast Fourier transform on the amplitude signal. In signal processing, a periodogram is an estimate of the spectral density of a signal.It is the most common tool for examining the amplitude vs frequency characteristics of FIR filters and window functions. The x axis denotes the frequency and the highest frequency it can represent is 22 KHz. 

![periodogram.png](attachment:periodogram.png)


<center><b>Figure 6 Periodograms for all the instruments<b></center> 

**c. Spectrogram:**

Spectrograms represent the frequency content in the audio as colors in an image which represents a spectrum of frequencies of signal that varies with time. The frequency content of milliseconds chunks is stringed together as colored vertical bars. This visual representation when applied to audio signal are called as sono graphs, voiceprints or voicegrams. This data when plotted on a 3D plot is called as waterfalls. 

The spectrogram shown in figure 7  is obtained from periodogram by stacking up the audio signals for every 10 seconds. It takes a small moment in time and we use standard window length of 25 ms to reduce the noise. Hanning window is applied to reduce spectral leakage. Short Term Fourier Transform is applied to get these spectrograms. This also removes blank spaces in the audio file. The graph includes time on the x axis and frequency on the y axis, with the lowest frequencies at the bottom and the highest frequencies at the top.



![spectrogram.png](attachment:spectrogram.png)

<center><b>Figure 7 Spectrogram representation<b></center> 


**d. MFCC Features:** 

The MFCC features shown in figure 8 are obtained by applying the Discrete Cosine Transform. These are the key features with which the model was trained. Mel Filter bank convert linear to log frequencies. This is done to accurately represent the audio signal. Initially 26 filters was present and when DCT is applied it is reduced to 13 filters. This DCT is used for audio compression and to remove overlaps.

![melfilter.png](attachment:melfilter.png)

<center><b>Figure 8 Extracted Features<b></center> 


**3. Training the model:**

We implemented two neural network models, namely Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). The input to these were nothing but the MFCC features.  The accuracy during training the data of CNN model was found to be higher than the accuracy of the RNN model. By increasing the hyper parameters we improved the accuracy a little.

- **Hyper Parameter Tuning:**
Hyperparameters for deep neural network is difficult as it is slow to train a deep neural network and there are numerous parameters to configure.

**a. Learning rate:**

Learning rate controls how much to update the weight in the optimization algorithm. We can use fixed learning rate, gradually decreasing learning rate, momentum-based methods or adaptive learning rates, depending on our choice of optimizer such as SGD, Adam, Adagrad, AdaDelta or RMSProp. 

**b. Number of epochs:**

Number of epochs is the number of times the entire training set pass through the neural network. We should increase the number of epochs until we see a small gap between the test error and the training error. 

**c. Batch size:**

Mini-batch is usually preferable in the learning process of convnet. A range of 16 to 128 is a good choice to test with. We should note that convnet is sensitive to batch size. 

**d. Activation function:**

Activation function introduces non-linearity to the model. Usually, rectifier works well with convnet. Other alternatives of activation functions are sigmoid, tanh and other functions depending on the task. 

**e. Number of hidden layers and units:**

It is usually good to add more layers until the test error no longer improves. The trade-off is that it is computationally expensive to train the network. Having a small number of units may lead to underfitting while having more units are usually not harmful with appropriate regularization.

**f. Weight initialization:**

Initialize the weights with small random numbers to prevent dead neurons, but not too small to avoid zero gradient. Uniform distribution usually works well. 

**g. Dropout for regularization:**

Dropout is a preferable regularization technique to avoid overfitting in deep neural networks. The method simply drops out units in neural network according to the desired probability. A default value of 0.5 is a good choice to test with.  


- **Algorithm**

Neural Networks are a set of algorithms that are designed to recognize patterns just like human brain. They interpret sensory data such as text, audio, video through a kind of machine perception or raw input. The patterns they recognize must be translated.

Neural networks are the core of deep learning. It is a field which has practical applications in many different areas. Today neural networks are used for image classification, speech recognition, object detection etc.

The basic neural network looks like the one shown in figure 9

<center><div style="width:500px;"![neuralnetwork.jpeg](attachment:neuralnetwork.jpeg)>
<img src="attachment:neuralnetwork.jpeg" width= 500"></div> </center>

<b><center>Figure 9 Neural Network</b></center>

**Convolutional Neural Network**

A Convolutional Neural Network (CNN) is a variation of (Multilayer Perceptron) MLP with having at least one convolutional layer. The complexity of the convolutional is reduced by applying concolvuiton on the input data and passing the output of convolutional layer to the input of the next layer.As we reduced the complexity of CNN , our model can handle complex data without any problem.

Keras provides an implementation of the convolutional layer called a Conv2D. The Conv2D layer is the first layer that we have used to extract audio features. Rows, channels and columns must be provided as parameters in the Conv2D layer. Strides is the number of pixel shifts over the input matrix. There are three types of padding : full, same and valid.

Among these three we used “same” padding which is nothing but appending zero to the left of the array and to the top of the 2D input matrix. Rectified Linear Unit (ReLU) activation function is used for a non-linear operation. ReLU’s purpose is to introduce non-linearity in our ConvNet.Here, we are adding three conv2d layers followed by max pooling layer  to reduce the spatial dimensions of the output volume.

Then a Dropout layer is added to prevent a model from overfitting. After this a Flatten layer is added to convert the pooled feature map to a single column that will be passed to the fully connected layer. Finally this is followed by a fully connected layer with softmax activation function with optimizer being “adam” optimization.

<center><div style="width:300px;"![cnn.jpeg](attachment:cnn.jpeg)>
<img src="attachment:cnn.jpeg" width= 300"></div> </center>


<b><center>Figure 10 Audio analysis using convolutional neural network</b></center>


**Recurrent Neural Network**

A Recurrent Neural Network (RNN) is designed to preserve the previous neuron state.This helps in producing the output of the current state based on the preserved previous state. RNN would understand the sequential data by making the current hidden state dependent on the previous hidden state. The spectrogram is obtained by periodogram by stacking 10 secs of data, hence spectrogram has time component which would be used by RNN to identify the short term and long term features of the audio sample.


<center><div style="width:400px;"![cnn.jpeg](attachment:rnn.jpeg)>
<img src="attachment:rnn.jpeg" width= 400"></div> </center>
                                          


<b><center>Figure 11 Recurrent Neural Network</b></center>


We used the Keras Sequential API because we build the network with one layer at a time. The layers are as follows:
Two layers of LSTM cells with dropout to prevent overfitting. Since we used two layers, return sequences are set to true.

- A Dropout layer to prevent overfitting to the training data.
- Timedistributed dense layer is used on RNN, including LSTM, to keep one-to-one relations on input and output.
- A Flatten layer is added to convert the pooled feature map to a single column that will be passed to the fully connected layer.
- A Dense fully-connected output layer. 

This produces a probability for every instrument using softmax activation.The model is thus compiled and saved.The model is compiled with the Adam optimizer and trained using the categorical_crossentropy loss. 


**4. Testing the model:**

After completing the training, the testing part comes in. Here, we should check whether our model is capable of predicting the exact result. In our project, the prediction of instruments names are done in a csv file. The prediction of two models is compared to determine which model is better. It was evidently proved that the RNN model was predicting more number of instrument names correctly when compared to the CNN model.



### Team Contribution
On the whole, we completed this project dividing the work equally between us. The work contributed by each of us is listed below:
 
<b> Shree Harini Ravichandran: </b> 
 
Instrument’s audio sample from internet was collected and 10 classes ( 10 types of instruments) was used for classification. So for each instrument 30 audio samples were collected and together the total number of audio samples summed up to 300. On completion of that, Pavithra generated the processed spectrogram, and I applied Discrete Cosine Transform to generate Mel Freq Cepstrum Coefficients. This MFCC contains 13 features, and these 13 features was analysed.

After Pre-Processing, the model was trained with Recurrent Neural Network(RNN). I added two LSTM layers with filter size being 128 and returned sequences being True. After this step, Dropout layer was added consequently four consecutive Time Distributed dense layers with ‘ReLU’ activation function was also added. A flattened layer was included after this step and at last dense layer having ten as filter size depicting the 10 classes of classification, with softmax activation function was included. For fitting the model, hyper-parameter tuning like changing epochs, learning rate etc was done to achieve better accuracy.

<b>Pavithra Govardhanan:</b>

After Shree Harini completed the data collection steps, I started with Data Pre-Processing. In that, the audio samples were converted into graphs using librosa package. Then Fast Fourier Transform (FFT) was applied to the audio signal , to generate Periodogram. Then on stacking the samples for 10 seconds, spectrogram for all ten instruments was obtained. To the spectrogram Short Term FFT was used, to generate processed spectrogram. The rest of the pre-processing was carried out by Shree Harini.

After Pre-Processing, I started with training the model with Convolutional Neural Network(CNN). Four consecutive Conv2D layers with kernel size being (3,3), with activation function being ‘ReLU’ and varying the filter sizes was added. Then a Max pooling layer followed by dropout and flatten was included which was followed by two consecutive dense fully connected layers with ‘ReLU’ activation function. Last dense layer had 10 as filter size depicting the 10 classes of classification, with softmax activation function. 

After training the models, both of us did the prediction part. After preprocessing we saved the wav files in a folder called “clean” since it included processed audio files. The trained models were saved having model extension and the pickle files were saved using .p extension. The predicted values were stored in “cnn_prediction.csv” and “rnn_prediction.csv”. The prediction files has the wav filename , and prediction probability of each instrument. From the probability value of each instrument, the instrument with the highest probability is predicted in a separate column. This last column is the predicted class. 



### RESULTS

The accuracy of two models during training were promising for audio classification and it was found to be 97.4% for CNN and 91.8 %  for RNN. This accuracy was improved by changing the hyper parameters. For example, when ‘tanh’ activation function was used as discussed in class, it did not perform well. However, there was an improvement in accuracy when Relu activation function was used. Although we have high training percentage for the CNN model than RNN, RNN classifies more number of instruments correctly compared to CNN while testing.

The instrument classification output from Convolutional Neural network:

![cnnoutput.jpeg](attachment:cnnoutput.jpeg)


The above figure is a part of the output that was generated for CNN. We can see that out of four saxophone audio samples it has classified three correctly. It calculated a probability value for each instrument class and the instrument with higher value is considered. 

The instrument classification output from Recurrent Neural Network:

![rnnoutput.jpeg](attachment:rnnoutput.jpeg)


Similarly, for an RNN the instrument with higher probability value is considered. However unlike CNN, RNN has correctly classified all four samples for the four input audio files.

The results of this training and validation accuracy and the loss is plotted on a graph and is shown below:

**Convolutional Neural Network**

Figure 12 depicts the accuracy and loss graph obtained after training the CNN model 


<center><div style="width:400px;"![rnngraph.jpeg](attachment:cnngraph.jpeg)>
<img src="attachment:cnngraph.jpeg" width= 400"></div> </center>

<b><center>Figure 12 Training vs validation accuracy and loss for CNN</b></center>

**Recurrent Neural Network**

Figure 13 depicts the accuracy and loss graph obtained after training the RNN model 


<center><div style="width:400px;"![rnngraph.jpeg](attachment:rnngraph.jpeg)>
<img src="attachment:rnngraph.jpeg" width= 400"></div> </center>


<center><b>Figure 13 Training vs validation accuracy and loss for RNN</center></b>


From the figure 12 and figure 13, it is clear that both models are trained well. On seeing the RNN’s graph we could understand that the accuracy is gradually increasing from 55%(approx). Whereas, for CNN the accuracy is not steadily increasing. The accuracy of CNN starts increasing from 85%. On seeing the loss graphs of both the models, it is evident that the model are trained perfectly to show better results. “Lesser the loss, the Greater is the accuracy”.

The low training accuracy of RNN compared to the better validation accuracy is may be because the model was over fitted. But in that case we would see an opposite trend where training loss is less than the validation loss. One possible result for this might be because all the entries in the dataset was used for training the model. If the data would have been split or cross validation technique was used the results would have been different.  

Another way of seeing this variation in accuracy, is that it is a result of dropout.Since we disable neurons, information about some samples is lost and the successive layers try to build information based on incomplete representations. Therefore it had resulted in a higher training loss and we have made it artificially harder to produce right answers. In contrast, during validation step all the units are present and the network has the entire information. These were our guesses on why RNN is performing better during validation step.     


## CONCLUSION
In conclusion, we were able to preprocess the audio samples and extract useful features from it. Using Convolutional and Recurrent Neural Network we trained the model to predict the musical instrument. We were able to achieve a good accuracy using both algorithms. Some of the challenges we faced in this project was finding a proper dataset, understanding digital signal processing concepts and understanding Recurrent Neural Network.

Initially our project was to identify the speaker from an audio sample, but since we had a hard time to preprocess it we shifted to musical instrument dataset. Second, the audio file inputs could not be used directly. It had to be preprocessed in several ways before we could actually use it. At first we thought only MFCC (Mel Frequency Cepstrum Coefficient) was only needed to obtain the features and tried applying that directly on our audio file. We obtained matrix outputs for that and we couldn’t infer anything from it. After reading through a lot of research papers, articles and YouTube videos we figured that there are few more preprocessing steps that has to be done before applying MFCC to the audio file.

Therefore, we followed the preprocessing steps that was mentioned. The audio file samples were converted into periodograms and spectrograms by applying Fast Fourier Transform and Short Fast Fourier Transform respectively. On these spectrogram outputs, Mel Filter bank was applied to extract the useful features. After this feature extraction step we trained our model. Since Convolutional Neural Network was covered in class we were able to apply our dataset to it and work on it easily. On the other hand Recurrent Neural Network was an entirely new concept; we had to again browse a lot about it to understand the concept. We had planned to finish the project before Thanksgiving break, unfortunately we could not do that. So we had to alter our plans  to make it fit our schedule. We did about sixty percent of our project during the fall break and almost completed the rest within that next  week.  

On the whole, this project was interesting and exciting to work on. Both of us had worked only on numerical values and very rarely image datasets. Audio dataset was something new and fun to play. This project gave us a lot of insights on how these audio files can be explored and used in different applications. Some of them could be similarity search for audio files, speaker identification, speaker authentication, indexing music files according to their features and recommendation systems.

There are various possibilities on how this project can be extended in future. Currently, this system is designed to classify an instrument when one instrument audio sample is played. This could be improved by having it classify the right instrument when a number of instruments are playing in the background. For example, an audio sample from a musical concert can be given as input and the instrument which is predominantly loud can be identified. 
Another way this could be improved is by using it on a real time dataset like classifying human voice. Initially, our project was to identify the speaker but due to dataset issues we could not implement it. Speaker identification has a number of applications and can be used in various security domains for authentication and validation purposes.

Another way this could be improved is by using it on a real time dataset like classifying human voice. Initially, our project was to identify the speaker but due to dataset issues we could not implement it. Speaker identification has a number of applications and can be used in various security domains for authentication and validation purposes.


### REFERENCES

[1][Steven W. Smith, 1997] Steven W. Smith,  The Scientist and Engineer's Guide to Digital Signal Processing, California Technical Publishing, 1997

[2]http://ismir2012.ismir.net/event/papers/559_ISMIR_2012.pdf

[3]http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

[4]https://github.com/tyiannak/pyAudioAnalysis/wiki/3.-Feature-Extraction

[5]https://medium.com/datadriveninvestor/audio-and-image-features-used-for-cnn-4f307defcc2f

[6]https://medium.com/@mikesmales/sound-classification-using-deep-learning-8bc2aa1990b7

[7]https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

[8]https://www.nti-audio.com/en/support/know-how/fast-fourier-transform-fft

[9]https://www.cs.tut.fi/sgn/arg/intro/basics.html

[10]http://www.cs.cmu.edu/afs/andrew/scs/cs/15-463/2001/pub/www/notes/fourier/fourier.pdf

[11]https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

[12]http://datagenetics.com/blog/november32012/index.html

[13]https://ccrma.stanford.edu/~jos/sasp/Mathematical_Definition_STFT.html

[14]https://www.sciencedirect.com/topics/engineering/short-time-fourier-transform


In [9]:
import io
from nbformat import current
import glob
nbfile = glob.glob('Project Report.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file Project Report.ipynb is 4091
