This Jupyter Notebook is designed to extract unique species names from an existing CSV file containing audio annotations. The annotations include information such as the path to the audio file, recording details, duration, time, frequency ranges, species names, and bounding box coordinates. 

The notebook performs the following tasks:

1. **Data Loading**: It loads the original CSV file containing audio annotations, which includes species names.

2. **Unique Species Extraction**: It extracts unique species names from the loaded CSV file.

3. **Data Transformation**: The unique species names are then stored in a new DataFrame.

4. **Data Saving**: Finally, the DataFrame containing the unique species names is saved to a new CSV file for further analysis or use in other projects.

By executing this notebook, users can efficiently extract and organize unique species names from the audio annotations dataset, facilitating subsequent analysis and data management tasks.

In [1]:
import pandas as pd

In [2]:
# Definir el ROOT_PATH
ROOT_PATH = "../../../desarrollo/"

# Path al CSV original de las anotaciones de audio
csv_file = ROOT_PATH + "Data/Annotations/audio_annotations.csv"

# Leer el CSV original
df = pd.read_csv(csv_file)

In [3]:
# Get the unique species names
unique_species = df['specie'].unique()

# Create a DataFrame with the unique species names
species_df = pd.DataFrame(unique_species, columns=['Species'])

# Order alphabetically
species_df = species_df.sort_values(by='Species')

# Path to the CSV to save the unique species names
output_csv = ROOT_PATH + "Data/Annotations/unique_species.csv"

# Save the unique species names to a new CSV
species_df.to_csv(output_csv, index=False)

In [4]:
# Function to perform species mapping
def map_species(input_file, output_file, species_mapping_file):
    # Get the species mapping DataFrame
    species_mapping_df = pd.read_csv(species_mapping_file)

    if species_mapping_df is None:
        print("Error getting species mapping.")
        return

    # Read the input file
    input_df = pd.read_csv(input_file)

    # Map the species using the mapping DataFrame
    input_df['specie'] = input_df['specie'].map(dict(zip(species_mapping_df['Species'], species_mapping_df['Specie_Name'])))

    # Number of rows of df
    print("Number of annotations: ", len(input_df))

    # Save the output file
    input_df.to_csv(output_file, index=False)

# Usage of the map_species function
input_file = ROOT_PATH + "Data/Annotations/audio_annotations.csv"
output_file = ROOT_PATH + "Data/Annotations/o01_audio_annotations_species.csv"
species_mapping_file = ROOT_PATH + "Data/Annotations/unique_species_mapped.csv"

map_species(input_file, output_file, species_mapping_file)

Number of annotations:  3700


In [5]:
input_file = ROOT_PATH + "Data/Annotations/o01_audio_annotations_species.csv"
output_file = ROOT_PATH + "Data/Annotations/o02_audio_annotations.csv"

# Read the input file
input_df = pd.read_csv(input_file)

# Delete rows with specie "quiroptera" (1) and unknown
input_df = input_df[input_df['specie'] != "quiroptera"]
# input_df = input_df[input_df['specie'] != "unknown"] # Unknown = Bird
input_df = input_df[input_df['specie'] != "abiotic noise"]
input_df = input_df[input_df['specie'] != "insect"]

# Get the species counts
species_counts = input_df['specie'].value_counts()

# Get the species with less than X samples
X = 10
less_than_10 = species_counts[species_counts < X]

# Map the species with less than 10 samples to "bird"
input_df['specie'] = input_df['specie'].apply(lambda x: "bird" if x in less_than_10 else x)

# unknown is bird
input_df['specie'] = input_df['specie'].apply(lambda x: "bird" if x == "unknown" else x)

# Save the output file
input_df.to_csv(output_file, index=False)

# Also save the CSV with name dataset.csv
output_file = ROOT_PATH + "Data/CSVs/dataset.csv"
input_df.to_csv(output_file, index=False)

In [6]:
# Number of rows of df
print("Number of annotations: ", len(input_df))

Number of annotations:  3449


In [3]:
output_file = ROOT_PATH + "Data/Annotations/o02_audio_annotations.csv"

In [4]:
# Take outputfile and print the unique species ordered alphabetically and preceded by an enumeration starting by 0 and :
df = pd.read_csv(output_file)
df['specie'] = df['specie'].astype(str)
# Sort unique species alphabetically
unique_species_sorted = sorted(df['specie'].unique())

# eliminate abiotic noise, unknown and nan
unique_species_sorted = [x for x in unique_species_sorted if x not in ['abiotic noise', 'nan', 'insect']]

# Put Bird the first
unique_species_sorted.remove('bird')
unique_species_sorted.insert(0, 'bird')

for i, specie in enumerate(unique_species_sorted):
    print(f"{i}: {specie}")

0: bird
1: alaudidae
2: calandrella brachydactyla
3: carduelis carduelis
4: cettia cetti
5: chloris chloris
6: ciconia ciconia
7: cisticola juncidis
8: coturnix coturnix
9: curruca
10: emberiza calandra
11: fringilla
12: galerida
13: lanius
14: luscinia megarhynchos
15: melanocorypha calandra
16: merops apiaster
17: milvus migrans
18: motacilla flava
19: parus major
20: passer
21: pica pica
22: saxicola rubicola
23: streptopelia
24: sturnus
25: tringa
26: turdus


In [4]:
output_file = ROOT_PATH + "Data/Annotations/o02_audio_annotations.csv"

annotations_df = pd.read_csv(output_file)

# Print number of instances
print("Count of instances: ", len(annotations_df))

# Print count of instances with low_frequency == -1
print("Count of instances with low_frequency == -1: ", len(annotations_df[annotations_df.low_frequency == -1]))

# high_frequency == -1
print("Count of instances with high_frequency == -1: ", len(annotations_df[annotations_df.high_frequency == -1]))

# Total number of instances with frequencies == -1
print("Total number of instances with frequencies == -1: ", len(annotations_df[(annotations_df.low_frequency == -1) | (annotations_df.high_frequency == -1)]))

print("\n\n\n\n")
# Print a row with low_frequency == -1
print("Low frequency == -1\n", annotations_df[annotations_df.low_frequency == -1].iloc[0])
print("\n\n")
# Print a row with high_frequency == -1
print("High frequency == -1\n", annotations_df[annotations_df.high_frequency == -1].iloc[0])

Count of instances:  3449
Count of instances with low_frequency == -1:  1
Count of instances with high_frequency == -1:  183
Total number of instances with frequencies == -1:  184





Low frequency == -1
 path                         AM1/2023_05_15/AM1_20230515_070000.WAV
recorder                                                        AM1
date                                                     2023/05/15
time                                                       07:00:00
audio_duration                                             00:01:00
start_time                                                28.225158
end_time                                                  29.801838
low_frequency                                                  -1.0
high_frequency                                          2844.192383
specie                                                         bird
bbox              [0.4827956989247312, 0.5881351138766071, 0.025...
Name: 73, dtype: object



High frequency == -

In [6]:
# Save df["path"] in a txt file for those rows with low_frequency == -1 or high_frequency == -1

# Get the paths of the rows with low_frequency == -1 or high_frequency == -1
paths = annotations_df[(annotations_df.low_frequency == -1) | (annotations_df.high_frequency == -1)].path

# only unique paths
paths = paths.drop_duplicates()

# Save the paths to a txt file
output_file = "missing_frequencies_paths.txt"
paths.to_csv(output_file, index=False)