This Jupyter Notebook is designed to extract unique species names from an existing CSV file containing audio annotations. The annotations include information such as the path to the audio file, recording details, duration, time, frequency ranges, species names, and bounding box coordinates. 

The notebook performs the following tasks:

1. **Data Loading**: It loads the original CSV file containing audio annotations, which includes species names.

2. **Unique Species Extraction**: It extracts unique species names from the loaded CSV file.

3. **Data Transformation**: The unique species names are then stored in a new DataFrame.

4. **Data Saving**: Finally, the DataFrame containing the unique species names is saved to a new CSV file for further analysis or use in other projects.

By executing this notebook, users can efficiently extract and organize unique species names from the audio annotations dataset, facilitating subsequent analysis and data management tasks.

In [1]:
import pandas as pd

In [2]:
# Definir el ROOT_PATH
ROOT_PATH = "../../../desarrollo/"

# Path al CSV original de las anotaciones de audio
csv_file = ROOT_PATH + "Data/Annotations/audio_annotations.csv"

# Leer el CSV original
df = pd.read_csv(csv_file)

In [7]:
# Get the unique species names
unique_species = df['specie'].unique()

# Create a DataFrame with the unique species names
species_df = pd.DataFrame(unique_species, columns=['Species'])

# Order alphabetically
species_df = species_df.sort_values(by='Species')

# Path to the CSV to save the unique species names
output_csv = "../Data/Annotations/unique_species.csv"

# Save the unique species names to a new CSV
species_df.to_csv(output_csv, index=False)

In [11]:
! cp ../Data/Annotations/unique_species_mapped.csv ../../../desarrollo/Data/Annotations/unique_species_mapped.csv 

In [12]:
# Function to perform species mapping
def map_species(input_file, output_file, species_mapping_file):
    # Get the species mapping DataFrame
    species_mapping_df = pd.read_csv(species_mapping_file)

    if species_mapping_df is None:
        print("Error getting species mapping.")
        return

    # Read the input file
    input_df = pd.read_csv(input_file)

    # Map the species using the mapping DataFrame
    input_df['specie'] = input_df['specie'].map(dict(zip(species_mapping_df['Species'], species_mapping_df['Specie_Name'])))

    # Number of rows of df
    print("Number of annotations: ", len(input_df))

    # Save the output file
    input_df.to_csv(output_file, index=False)

# Usage of the map_species function
input_file = ROOT_PATH + "Data/Annotations/audio_annotations.csv"
output_file = ROOT_PATH + "Data/Annotations/audio_annotations_standarized.csv"
species_mapping_file = ROOT_PATH + "Data/Annotations/unique_species_mapped.csv"

map_species(input_file, output_file, species_mapping_file)

Number of annotations:  3171


In [14]:
input_file = ROOT_PATH + "Data/Annotations/audio_annotations_standarized.csv"
output_file = ROOT_PATH + "Data/Annotations/audio_annotations_standarized.csv"

# Read the input file
input_df = pd.read_csv(input_file)

# Delete rows with specie "quiroptera" (1) and unknown
input_df = input_df[input_df['specie'] != "quiroptera"]
input_df = input_df[input_df['specie'] != "unknown"] # Unknown = Bird
input_df = input_df[input_df['specie'] != "abiotic noise"]
input_df = input_df[input_df['specie'] != "insect"]

# Get the species counts
species_counts = input_df['specie'].value_counts()

# Get the species with less than X samples
X = 10
less_than_10 = species_counts[species_counts < X]

# Map the species with less than 10 samples to "bird"
input_df['specie'] = input_df['specie'].apply(lambda x: "bird" if x in less_than_10 else x)

# unknown is bird
# input_df['specie'] = input_df['specie'].apply(lambda x: "bird" if x == "unknown" else x)

# Save the output file
input_df.to_csv(output_file, index=False)

# Also save the CSV with name dataset.csv
output_file = ROOT_PATH + "Data/Dataset/CSVs/dataset.csv"
input_df.to_csv(output_file, index=False)

In [15]:
# Number of rows of df
print("Number of annotations: ", len(input_df))

Number of annotations:  1800


In [18]:
output_file = ROOT_PATH + "Data/Dataset/CSVs/dataset.csv"

In [19]:
# Take outputfile and print the unique species ordered alphabetically and preceded by an enumeration starting by 0 and :
df = pd.read_csv(output_file)
df['specie'] = df['specie'].astype(str)
# Sort unique species alphabetically
unique_species_sorted = sorted(df['specie'].unique())

# eliminate abiotic noise, unknown and nan
unique_species_sorted = [x for x in unique_species_sorted if x not in ['abiotic noise', 'nan', 'insect', 'unknown']]

# Put Bird the first
unique_species_sorted.remove('bird')
unique_species_sorted.insert(0, 'bird')

for i, specie in enumerate(unique_species_sorted):
    print(f"{i}: {specie}")

0: bird
1: actitis hypoleucos
2: anthus pratensis
3: calandrella brachydactyla
4: carduelis carduelis
5: cettia cetti
6: chloris chloris
7: ciconia ciconia
8: cisticola juncidis
9: curruca
10: emberiza calandra
11: erithacus rubecula
12: fringilla
13: galerida
14: lanius
15: luscinia megarhynchos
16: merops apiaster
17: motacilla
18: parus major
19: passer
20: pica pica
21: saxicola rubicola
22: serinus serinus
23: streptopelia decaocto
24: sturnus
25: turdus merula
