# Preamble

This notebook should be used when your dataset has not been built because your audio files have not passed at least one test needed for its uploading on the OSmOSE platform. It also allows to perform (irreversible) file deletion operations to meet uploading criteria.

Define the names of the dataset and of the folder of audio files (by default, 'original')

In [None]:
dataset = 'boussole_MERMAID_v2'
audio_folder_name = 'original'

Download the metadata csv file 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os

path_audio = '/home/datawork-osmose/dataset/'+dataset+'/data/audio/'+audio_folder_name+'/'
path_file_metadata = path_audio+'file_metadata.csv'
file_metadata = pd.read_csv(path_file_metadata)

# Exploring / describing metadata

In [None]:
file_metadata.head()

In [None]:
file_metadata.describe()

## Reading header

In [None]:
print('Number of file headers that cannot be read :',sum(file_metadata['status_read_header'].values == False))

## Sampling rate

In [None]:
df_da=file_metadata['origin_sr'].value_counts().reset_index()
df_da.columns = ['Sampling rate (Hz)', 'Counts']
df_da['Sampling rate (Hz)'].hist()
df_da

## Duration

In [None]:
df_da=file_metadata['duration'].value_counts().reset_index()
df_da.columns = ['Duration(s)', 'Counts']
df_da['Duration(s)'].hist()
df_da

# Delete audio files based on criteria metadata

The cells below allow you to delete some audio files that would not respect certain criteria. These files are stored in the list `list_files_to_be_deleted` and the last cell below can be used to operate the deletion directly from this notebook, but be carefull this operation is irreversible!

Each deletion will generate automatically a text file in your current working directory containing the list of the deleted files so you can keep track of this operation.

Note that each criterion is exclusive, meaning that they should be used independently from each other. 

Also note that the file `'file_metadata.csv'` and the pandas variable `file_metadata` is directly updated.

## Criterion 1 : files with non-readable headers

In [None]:
list_files_to_be_deleted = list(file_metadata[file_metadata['status_read_header'].values == False]['filename'])
criterion = 1
print(f'Number of files to be removed : {len(list_files_to_be_deleted)}')

## Criterion 2 : files with duration under a certain value 

Change the value of `duration_value` (in seconds) below

In [None]:
duration_value = 60

list_files_to_be_deleted = list(file_metadata[file_metadata['duration'].values < duration_value]['filename'])
criterion = 2
print(f'Number of files to be removed : {len(list_files_to_be_deleted)}')

## Deletion code

Used carefully ! There is a "garde-fou" variable `DO_operation` that should be set to True to use this code 

In [None]:
DO_operation = False

for file_name in list_files_to_be_deleted:
    try:
        if DO_operation:
            os.remove( path_audio + file_name)
            file_metadata.drop(file_metadata.loc[file_metadata['filename']==file_name].index, inplace=True)
            print(f'File {file_name} removed')
    except: 
        print(f'File {file_name} could not be removed')   

if DO_operation:
    if criterion==2:
        textp = f"Following files were removed based on the duration criterion with a value of {duration_value} (in seconds) : \n\n"
        fn = 'deleted_files_criterion_duration.txt'
    elif criterion==1:
        textp = f"Following files were removed based on the non-readable header criterion : \n\n"
        fn = 'deleted_files_criterion_nonreadable_header.txt'

    with open(fn, 'w') as f:
        if criterion==2:
            f.write(textp)
        elif criterion==1:
            f.write(f"Following files were removed based on the non-readable header criterion : \n\n")
        for line in list_files_to_be_deleted:
            f.write(f"{line}\n")