## Objective of the Notebook
(NoobAtem) Make sure to run drive.py before interacting with my 
notebooks. I created this notebook to capture a certain criteria, you 
can use this for your own liking but the notebook purpose is to do the following
* Overview of data sources
* Inspection of each data source
* Format the data and integrate
* Documentation
* Save the data   

### Prerequisite Libraries

In [1]:
import os
import shutil
import pandas as pd
import numpy as np
import re
import librosa
import yaml
import IPython.display as ipd

### Data Overview

Lets start with defining relevant paths for this notebook convinience

In [2]:
# The main parent folder for handling in all sorts of data
DATA_P: str = "../data"

RAW_P: str = os.path.join(DATA_P, "raw") # Data that we've formatted and designed/collected
EXTERN_P: str = os.path.join(DATA_P, "external") # 3rd party dataset
INTERIM_P: str = os.path.join(DATA_P, "interim") # Dataset that is being process and its under commission
PROCESS_P: str = os.path.join(DATA_P, "processed") # Processed data ready to be used

# The main parent for handling configs
CONFIG_P: str = "../configs"

DATA_CONF_P: str = os.path.join(CONFIG_P, "dataset_config.yaml")

We can read the dataset_config.yaml file to know what available data source we can currently work in

In [32]:
config: dict = None
with open(DATA_CONF_P, "r") as yml_file:
    config = yaml.safe_load(yml_file)

*External Data Source*

In [4]:
for val in config["external"].keys():
    if not val == "description":
        print(f"Found source: {val}")

Found source: esc50


*Internal Data Source*

In [5]:
for val in config["internal"].keys():
    if not val == "description":
        rint(f"Found source: {val}")

Currently there is only one data source within the project and that is the ESC-50

### Inspection of Source

Inspecting the ESC-50 project folder

In [33]:
#define the path first then view revelency to the project structure

ESC50_P: str = os.path.join(EXTERN_P, "esc50")
list(filter(lambda v: not re.search("\.", v),os.listdir(ESC50_P)))

['tests', 'audio', 'LICENSE', 'meta']

Lets display tests and meta folder contents

In [34]:
TEST_ESC50_P: str = os.path.join(ESC50_P, "tests")
META_ESC50_P: str = os.path.join(ESC50_P, "meta")
AUDIO_ESC50_P: str = os.path.join(ESC50_P, "audio")

In [18]:
os.listdir(TEST_ESC50_P)

['test_dataset.py']

In [19]:
os.listdir(META_ESC50_P)

['esc50.csv', 'esc50-human.xlsx']

Let us verify esc50 has the door and glass sounds that we need

In [35]:
# Get audio format extension
audio_ext: str = os.listdir(AUDIO_ESC50_P)[0].split(".")[-1]
audio_ext

'wav'

In [9]:
# Lets view esc50.csv features
esc50: pd.DataFrame = pd.read_csv(os.path.join(META_ESC50_P, "esc50.csv"))
esc50.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
0,1-100032-A-0.wav,1,0,dog,True,100032,A
1,1-100038-A-14.wav,1,14,chirping_birds,False,100038,A
2,1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A
3,1-100210-B-36.wav,1,36,vacuum_cleaner,False,100210,B
4,1-101296-A-19.wav,1,19,thunderstorm,False,101296,A


In [10]:
# View unique categories
esc50.category.unique()

array(['dog', 'chirping_birds', 'vacuum_cleaner', 'thunderstorm',
       'door_wood_knock', 'can_opening', 'crow', 'clapping', 'fireworks',
       'chainsaw', 'airplane', 'mouse_click', 'pouring_water', 'train',
       'sheep', 'water_drops', 'church_bells', 'clock_alarm',
       'keyboard_typing', 'wind', 'footsteps', 'frog', 'cow',
       'brushing_teeth', 'car_horn', 'crackling_fire', 'helicopter',
       'drinking_sipping', 'rain', 'insects', 'laughing', 'hen', 'engine',
       'breathing', 'crying_baby', 'hand_saw', 'coughing',
       'glass_breaking', 'snoring', 'toilet_flush', 'pig',
       'washing_machine', 'clock_tick', 'sneezing', 'rooster',
       'sea_waves', 'siren', 'cat', 'door_wood_creaks', 'crickets'],
      dtype=object)

In [12]:
# Notice audio for glass_breaking, door_wood_creaks, and door_wood_knock
# Count out the category
sum(esc50.category.isin(["door_wood_knock", "glass_breaking", "door_wood_creaks"]))

120

In [29]:
# Let as hear an example of the audio
_temp_path: str = os.path.join(AUDIO_ESC50_P, esc50[esc50.category == "door_wood_knock"].reset_index(drop=True).loc[0, "filename"])
print(f"The path {_temp_path}")
x, sr = librosa.load(_temp_path)
print(f"This is the sample rate: {sr}")
ipd.Audio(x, rate=sr, autoplay=True)

The path ../data/external/esc50/audio/1-101336-A-30.wav
This is the sample rate: 22050


Inspecting *Door Sound (Rion Hermoso)* Folder

In [4]:
# Create a path to the door sound
door_sounds_rp: str = os.path.join(EXTERN_P, "Door sounds")
len(os.listdir(door_sounds_rp))

1701

### Initialize Raw Data Setup

Next, lets generate a simple structure in raw data, we shall filter and fetch only relevant data that shall capture this projects requirements. The dataset will have a door, and glass folders with the csv file indicated the naming convention and the path of each source.

In [22]:
# Verify if project structure has already been initialized
DOOR_RAW_P: str = os.path.join(RAW_P, "door")
GLASS_RAW_P: str = os.path.join(RAW_P, "glass")
META_RAW_P: str = os.path.join(RAW_P, "metadata.csv")

In [23]:
if not os.path.exists(DOOR_RAW_P):
    os.mkdir(DOOR_RAW_P)
if not os.path.exists(GLASS_RAW_P):
    os.mkdir(GLASS_RAW_P)

Its about time we reformat the csv file to fit the needs of the project

In [40]:
if not os.path.exists(META_RAW_P) or not os.path.isfile(META_RAW_P):
    # Define the appropriate columns
    index: list = ["id", "filename", "filepath", "category"]
    columns: object = pd.Index(index, dtype=object)
    df: pd.DataFrame = pd.DataFrame(columns=columns)
    
    # Save to raw
    df.to_csv(META_RAW_P, index=False)

Let us verify the csv

In [35]:
pd.read_csv(META_RAW_P).head()

Unnamed: 0,id,filename,filepath,category


Were going to make a function that handles updating meta file in case of new file transfers

In [41]:
def update_metadata_csv() -> pd.DataFrame:
    df: pd.DataFrame = pd.read_csv(META_RAW_P)

    # Were going to performance a setdiff1D
    door_audio: np.array = np.setdiff1d(os.listdir(DOOR_RAW_P), [".ipynb_checkpoints"])
    glass_audio: np.array = np.setdiff1d(os.listdir(GLASS_RAW_P), [".ipynb_checkpoints"])
    #print(f"Current audio, {door_audio}")
    #print(f"Current audio, {glass_audio}")
    
    # This is to make sure that we dont put the same file into the csv
    door_paths_in_csv: np.array = df[df["category"] == "door_sound"]["filename"].to_numpy()
    glass_paths_in_csv: np.array = df[df["category"] == "glass_breaking"]["filename"].to_numpy()
    #print(f"Current csv, {door_paths_in_csv}")
    #print(f"Current csv, {glass_paths_in_csv}")
    
    # Decide on the difference
    door_audio_missing: list = list(np.setdiff1d(door_audio, door_paths_in_csv))
    glass_audio_missing: list = list(np.setdiff1d(glass_audio, glass_paths_in_csv))
    #print(f"Currently missing wav file, {door_audio_missing}")
    #print(f"Currently missing wav file, {glass_audio_missing}")

    for missing in door_audio_missing:
        # Get the id which is already specified in the filename
        door_id = missing.split(".")[0].split("_")[-1]
        df.loc[len(df.index)] = [door_id, missing, os.path.join(DOOR_RAW_P, missing), "door_sound"]

    for missing in glass_audio_missing:
        # Get the id which is already specified in the filename
        glass_id = missing.split(".")[0].split("_")[-1]
        df.loc[len(df.index)] = [glass_id, missing, os.path.join(GLASS_RAW_P, missing), "glass_breaking"]

    df.to_csv(META_RAW_P, index=False)
    return df

### Data Integration to Raw Dataset

We will finally transport audio files to the raw datasets for purposes of reserving the entire esc50 while we modify and transform the raw version. The objective first would be to get all of the audio file that fits the agenda

In [33]:
# Get all category matching the agenda
esc50_prep: pd.DataFrame = esc50[esc50.category.isin(["door_wood_knock", "glass_breaking", "door_wood_creaks"])]
esc50_prep.head()

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
6,1-101336-A-30.wav,1,30,door_wood_knock,False,101336,A
9,1-103995-A-30.wav,1,30,door_wood_knock,False,103995,A
10,1-103999-A-30.wav,1,30,door_wood_knock,False,103999,A
95,1-20133-A-39.wav,1,39,glass_breaking,False,20133,A
139,1-26188-A-30.wav,1,30,door_wood_knock,False,26188,A


In [80]:
count_id_door: int = len(os.listdir(DOOR_RAW_P))
count_id_glass: int = len(os.listdir(GLASS_RAW_P))

for ids, row in esc50_prep.iterrows():
    source: str = os.path.join(AUDIO_ESC50_P, f"{row['filename']}")
    if row["category"] == "door_wood_knock" or row["category"] == "door_wood_creaks":
        fileout: str = os.path.join(DOOR_RAW_P, f"door-{count_id_door}.{audio_ext}")
        shutil.copy(source, fileout)
        count_id_door += 1
    elif row["category"] == "glass_breaking":
        fileout: str = os.path.join(GLASS_RAW_P, f"glass-{count_id_glass}.{audio_ext}")
        shutil.copy(source, fileout)
        count_id_glass += 1

In [43]:
# Update the meta csv
update_metadata_csv()

Unnamed: 0,id,filename,filepath,category
0,door-1,door-1.wav,../data/raw/door/door-1.wav,door_sound
1,door-10,door-10.wav,../data/raw/door/door-10.wav,door_sound
2,door-11,door-11.wav,../data/raw/door/door-11.wav,door_sound
3,door-12,door-12.wav,../data/raw/door/door-12.wav,door_sound
4,door-13,door-13.wav,../data/raw/door/door-13.wav,door_sound
...,...,...,...,...
115,glass-5,glass-5.wav,../data/raw/glass/glass-5.wav,glass_breaking
116,glass-6,glass-6.wav,../data/raw/glass/glass-6.wav,glass_breaking
117,glass-7,glass-7.wav,../data/raw/glass/glass-7.wav,glass_breaking
118,glass-8,glass-8.wav,../data/raw/glass/glass-8.wav,glass_breaking


### Initialize Interim Setup

Just follow the same process from initializing raw

In [36]:
# Verify if project structure has already been initialized
DOOR_INT_P: str = os.path.join(INTERIM_P, "door")
GLASS_INT_P: str = os.path.join(INTERIM_P, "glass")
META_INT_P: str = os.path.join(INTERIM_P, "metadata.csv")

In [37]:
if not os.path.exists(DOOR_INT_P):
    os.mkdir(DOOR_INT_P)
if not os.path.exists(GLASS_INT_P):
    os.mkdir(GLASS_INT_P)

In [38]:
if not os.path.exists(META_INT_P) or not os.path.isfile(META_INT_P):
    # Define the appropriate columns
    index: list = ["id", "filename", "filepath", "category"]
    columns: object = pd.Index(index, dtype=object)
    df: pd.DataFrame = pd.DataFrame(columns=columns)
    
    # Save to raw
    df.to_csv(META_INT_P, index=False)