## Calculating Audio Durations and Updating CSV

This Jupyter Notebook calculates the durations of audio files and updates a CSV file. The CSV file, named `audio_annotations.csv`, is expected to have information about audio files, including paths.

1. **Read the CSV File**: The code reads the input CSV file, which is assumed to be structured with columns for paths and other attributes.

2. **Sort by Path**: The audio files are sorted by their paths to group identical audio files together for efficient duration calculations.

3. **Calculate Durations**: It iterates through the sorted DataFrame and calculates the duration for each audio file using the `pydub` library.

4. **Update the DataFrame**: The code updates the DataFrame with the calculated audio durations. If an audio file shares a path with a previous one, it reuses the previously calculated duration to save computation time.

5. **Rearrange Columns**: The columns are rearranged to place the new 'audio_duration' column next to the 'time' column.

6. **Save to New CSV**: The updated DataFrame is saved as a new CSV file.


In [1]:
import pandas as pd
from pydub import AudioSegment

In [2]:
ROOT_PATH = "../../../desarrollo/"

DATASET_FOLDER = ROOT_PATH + "Data/Dataset/Audios/"

# Load the CSV file
input_file = ROOT_PATH + "Data/Annotations/" + "d00_audio_annotations.csv"
df = pd.read_csv(input_file)

# Path to the folder where you want to save the CSV files
output_file = ROOT_PATH + "Data/Annotations/" + "d01_audio_annotations.csv"

In [6]:
# Leer el archivo CSV
df = pd.read_csv(input_file)

In [9]:
# Sort the DataFrame by 'path'
df = df.sort_values(by='path')

# Look for files that are in the csv but not in the folder and print path
for index, row in df.iterrows():
    file_path = row["path"]
    file_path = DATASET_FOLDER + file_path
    try:
        audio = AudioSegment.from_file(file_path)
    except FileNotFoundError:
        print(file_path)

In [8]:
# Delete rows with files that are not in the folder
# df = df.drop(df[df['path'] == 'AM1/2023_05_10/AM1_20230510_110000.WAV'].index)

In [10]:
# Sort the DataFrame by 'path'
df = df.sort_values(by='path')

# Create a new column to store the audio duration in HH:MM:SS format
df['audio_duration'] = ""

# Initialize variables to track the previous path and its audio duration
prev_path = ""
prev_duration = None

# Iterate through the sorted DataFrame and calculate the duration of each audio
for index, row in df.iterrows():
    path = row['path']
    if path != prev_path:
        try:
            # This is a new audio file, calculate its duration
            audio = AudioSegment.from_file(DATASET_FOLDER + path, format="wav")
            duration_seconds = len(audio) / 1000  # Convert to seconds
            duration_time = pd.to_datetime(duration_seconds, unit='s').strftime('%H:%M:%S')
            df.at[index, 'audio_duration'] = duration_time
            prev_duration = duration_time
        except FileNotFoundError:
            print(file_path)
    else:
        # This audio file has the same path as the previous one, use the previous duration
        df.at[index, 'audio_duration'] = prev_duration

    prev_path = path

# Rearrange the columns to place 'audio_duration' after 'time'
columns = ['path', 'annotator', 'recorder', 'date', 'time', 'audio_duration', 'start_time', 'end_time', 'low_frequency', 'high_frequency', 'specie']
#columns = ['path', 'recorder', 'date', 'time', 'audio_duration', 'start_time', 'end_time', 'specie']
df = df[columns]

# Save the updated DataFrame to a new CSV file
df.to_csv(output_file, index=False)

In [11]:
df

Unnamed: 0,path,annotator,recorder,date,time,audio_duration,start_time,end_time,low_frequency,high_frequency,specie
1170,AM1/2023_05_10/AM1_20230510_060000.WAV,Edu,AM1,2023/05/10,06:00:00,00:01:00,12.488571,13.054286,2956.591309,5252.657715,unknown
1175,AM1/2023_05_10/AM1_20230510_060000.WAV,Edu,AM1,2023/05/10,06:00:00,00:01:00,30.548571,32.537143,2820.347168,5601.550781,galerida theklae
1173,AM1/2023_05_10/AM1_20230510_060000.WAV,Edu,AM1,2023/05/10,06:00:00,00:01:00,20.468571,21.694286,1914.604370,4747.935059,cyanopica cooki
1172,AM1/2023_05_10/AM1_20230510_060000.WAV,Edu,AM1,2023/05/10,06:00:00,00:01:00,17.537143,18.848571,3050.338623,5177.788574,galerida theklae
1171,AM1/2023_05_10/AM1_20230510_060000.WAV,Edu,AM1,2023/05/10,06:00:00,00:01:00,13.054286,13.431429,1964.719482,6681.761230,cyanopica cooki
...,...,...,...,...,...,...,...,...,...,...,...
1185,AM8/2023_03_04/AM8_20230304_110000.WAV,Giulia,AM8,2023/03/04,11:00:00,00:01:00,12.077519,21.122020,421.355051,647.880879,streptotelia decaocto
1186,AM8/2023_03_04/AM8_20230304_110000.WAV,Giulia,AM8,2023/03/04,11:00:00,00:01:00,13.564934,18.850714,1136.428688,14549.684570,sturnus un/vulg?
1188,AM8/2023_03_04/AM8_20230304_110000.WAV,Giulia,AM8,2023/03/04,11:00:00,00:01:00,17.846206,27.115320,243.147827,16000.000000,ciconia ciconia
1189,AM8/2023_03_04/AM8_20230304_110000.WAV,Giulia,AM8,2023/03/04,11:00:00,00:01:00,22.380322,28.795481,1996.703886,16000.000000,sturnus un/vulg
