
# Introduction to Audio Exploratory Data Analysis
## Part 1: Creating Metadata

<b> By Daniel Gladman, December 2022 </b>




This will be a very brief tutorial that showcases some functions that will be used to explore the data set that was scrapped together in the previous tutorial.

I will briefly showcase a library called Mutagen which can be used to extract the duration of an audio file and store it in a variable for later usage.



In [1]:
# Import the libraries
from mutagen.wave import WAVE

import IPython.display as ipd
import mutagen
import os 
import pandas as pd

Exracting the duration is simple using Mutagen's WAVE function. 
This function reads the audio file, '.info' extracts the information in a wavestream object, and '.length' captures the duration in seconds.

In [2]:
example_file = './Data/chicken/chicken_23.wav'

audio = WAVE(example_file)
print(audio)
audio_info = audio.info
print(audio_info)
length = int(audio_info.length)
print(f'Duration = {length} second/s')

{}
<mutagen.wave.WaveStreamInfo object at 0x000001975D42DCD0>
Duration = 1 second/s


To confirm, lets load the file with Ipython.

In [3]:
ipd.Audio(example_file)

The purpose of extracting this information is that we may want to create some metadata about all the files we scrapped together. 

This metadata could be used for purposes of data cleaning and exploration.

I will create a function that captures the length using Metagen. I will also extract the label and the filename from the folders we created during scrapping.

I will then store all this information into a dataframe and save it as a csv file. We can load the csv file later and use it to perform some other cleaning functions. But let's save that for the next part.

In [4]:
def CalcAudioDuration(length):
    """
    Function to compute the duration. You could add more features like hours and minutes here, 
    but for now it would just be seconds.
    """
    seconds = length  # calculate in seconds
  
    return  seconds  # returns the duration


def ComputeAudioDuration(filename):
    """
    This function will retreive the duration from any file passed to it
    """
    audio = WAVE(filename)
    audio_info = audio.info
    length = int(audio_info.length)
    seconds = CalcAudioDuration(length)
    return seconds


def ExtractClassAndFile(path):
    classes = []
    filenames = []
    folders = os.listdir(path)

    for folder in folders:
        filepath = f"{path}{folder}/"
        for file in os.listdir(filepath):
            classes.append(folder)
            filenames.append(file)
    return classes, filenames


def GetAudioDurations(path, classes, filenames):
    seconds = []
    for cls, filename in zip(classes, filenames):
        fp = f"{path}{cls}/{filename}"
        second = ComputeAudioDuration(fp)
        seconds.append(second)
    return seconds


def CreateMetaData(path):
    classes, filenames = ExtractClassAndFile(path)
    seconds = GetAudioDurations(path, classes, filenames)
    
    feature_colname = ['filename', 'seconds', 'class']
    df = pd.DataFrame ([filenames, seconds, classes]).T
    df.columns = feature_colname
    return df


def WriteMetadata(path):
    df = CreateMetaData(path)
    df.to_csv(f'{path}/metadata.csv')


path = "./Data/"
WriteMetadata(path)

Now that its done, we have a dataframe that we can use later.

In [5]:
pd.read_csv('.\Data\metadata.csv')

Unnamed: 0.1,Unnamed: 0,filename,seconds,class
0,0,bird_1.wav,1,bird
1,1,bird_10.wav,1,bird
2,2,bird_100.wav,3,bird
3,3,bird_101.wav,1,bird
4,4,bird_102.wav,3,bird
...,...,...,...,...
870,870,sheep_5.wav,3,sheep
871,871,sheep_6.wav,0,sheep
872,872,sheep_7.wav,5,sheep
873,873,sheep_8.wav,3,sheep


That's it for now. Next time, let's hunt for duplicated files.