# 🐦 Audio 101. 2- Detailed EDA

## [BirdCLEF 2022](https://www.kaggle.com/c/birdclef-2022)
### Identify bird calls in soundscapes
![](https://storage.googleapis.com/kaggle-competitions/kaggle/33246/logos/header.png)


## Hi and welcome! This is the second kernel of the series `Audio 101`, the documentation of my learning process in the amazing world of audio processing.

**In this short kernel we will perform a detailed EDA of the input data [BirdCLEF 2022](https://www.kaggle.com/c/birdclef-2022/) competition, bringing data from the `ebirds.org` site with `requests` and creating a concise "bird card" for each species.**


This series aims to get a good understanding of the specific topic from zero.

The ideal reader is a Data Scientist noob with some general knowledge about Deep Learning, but no technical expertise in Audio Processing. 

---

The full series consists of the following notebooks:
1. [🐦 Audio 101. 1-Audio manipulation & musical notes](https://www.kaggle.com/julian3833/audio-101-1-audio-manipulation-musical-notes/)
2. _[🐦 Audio 101. 2- Detailed EDA](https://www.kaggle.com/julian3833/audio-101-2-detailed-eda/) (This notebook)_



This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* **Plot Fourier Transforms and Spectrograms**
* **Build a simple CNN classifier model over image features**
* **Study the previous competition [BirdCLEF 2021 - Birdcall Identification](https://www.kaggle.com/c/birdclef-2021) and migrate some good models**

---



#  Please _DO_ upvote if you found this useful or interesting!


Enough chitchat, let's code!

# Imports

In [None]:
import os
import json
import requests
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup

import torchaudio
from IPython.display import Audio, display, HTML, Image

# Explore the input data

In [None]:
BASE_PATH = "../input/birdclef-2022/"
os.listdir(BASE_PATH)

# train_metadata.csv

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data):
- A wide range of metadata is provided for the training data. The most directly relevant fields are:

* **primary_label** - a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
* **secondary_labels**: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.
* **author** - the eBird user who provided the recording.
* **filename**: the associated audio file.
* **rating**: Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.

In [None]:
df_train = pd.read_csv(f"{BASE_PATH}train_metadata.csv")
df_train.shape

In [None]:
df_train.head()

In [None]:
df_train.head()

Let's drop a few columns to grasp the core of the problem

In [None]:
df_train = df_train[['primary_label', 'time', 'filename']]
df_train.head()

Is time an hour of the day or a pointer to a position in the audio?

In [None]:
df_train['time'].value_counts().head(20)

In [None]:
# Each filename is unique
df_train['filename'].nunique() == len(df_train)

It is a time of the day. Each `filename` is unique. Let's drop `time` as well:

In [None]:
df_train = df_train[['primary_label', 'filename']]
df_train.head()

## Total number of species

There are 152 different bird species:

In [None]:
df_train['primary_label'].nunique()

In [None]:
df_train['primary_label'].unique()

## Audio samples per species

In [None]:
audio_samples_per_species = df_train.groupby("primary_label")['filename'].count().sort_values(ascending=False)
audio_samples_per_species.sort_values(ascending=True).plot.barh(figsize=(25, 40), alpha=0.5, title="Samples per species");

In [None]:
audio_samples_per_species[:50].plot.bar(figsize=(25, 5), rot=45, alpha=0.5, title="Samples per species. 50 species with more samples")
plt.show()
audio_samples_per_species[50:100].plot.bar(figsize=(25, 5), rot=45, alpha=0.5, title="Samples per species. Species 50-100")
plt.show()
audio_samples_per_species[100:].plot.bar(figsize=(25, 5), rot=45, alpha=0.5, title="Samples per species. Species with less samples");

In [None]:
# 20 species with less samples
audio_samples_per_species[-20:]

# scored_birds.json

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data): **The subset of the species in the dataset that are scored.**

From the [Evaluation tab](https://www.kaggle.com/c/birdclef-2022/overview/evaluation): **Given the amount of audio data used in this competition it wasn't feasible to label every single species found in every soundscape. Instead only a subset of species are actually scored for any given audio file.**

It seems we will only need to detect the bird species present in this file. There are  21.

In [None]:
with open(f"{BASE_PATH}scored_birds.json") as fp:
    scored = json.load(fp)
    
print(f"Scored species: {scored}")
print(f"Total scored species: {len(scored)}")

In [None]:
scored_species_samples = df_train[df_train['primary_label'].isin(scored)]\
                            .groupby('primary_label').count()\
                            .rename(columns={'filename': 'total'})\
                            .sort_values('total', ascending=False)
scored_species_samples

In [None]:
scored_species_samples.plot.bar(figsize=(25, 5), rot=0, alpha=0.5, title="Samples per species for scored species");

There are various species for which we have very little data. That will definitely be a challenge!

# train_audio path


In [None]:
TRAIN_AUDIO_PATH = f"{BASE_PATH}train_audio/"
train_subfolders = os.listdir(TRAIN_AUDIO_PATH)
print("Train subfolders: ", train_subfolders)
print("Total subfolders: ", len(train_subfolders))

In [None]:
# Folders are the same as df_train labels
set(train_subfolders) == set(df_train['primary_label'].tolist())

In [None]:
total_files = sum([len(files) for r, d, files in os.walk(TRAIN_AUDIO_PATH)])
total_files

In [None]:
# The total number of files is the same as the total number of rows in df train
total_files == len(df_train)

## Play some files

See [🐦Audio 101- 1) Audio manipulation & musical notes](https://www.kaggle.com/julian3833/audio-101-1-audio-manipulation-musical-notes)

In [None]:
def plot_waveform(waveform, sample_rate, color=None):
    """
    Arguments:
        waveform: 1D array
        sample_rate: int
    """

    if color is None:
        color = np.random.choice(list(matplotlib.colors.TABLEAU_COLORS.keys()))
    ax = pd.Series(waveform).plot(figsize=(20, 5), alpha=0.6, color=color)
    duration = len(waveform) / sample_rate
    
   
    ticks = list(range(0, sample_rate*int(duration+1), sample_rate))
    labels = [label for label, _ in enumerate(ticks, 0)]
    
    if duration > 90:
        ticks = list(range(0, sample_rate*int(duration+1), sample_rate*10))
        labels = [10*label for label, _ in enumerate(ticks, 0)]
    

    ax.set_xticks(ticks)
    ax.set_xticklabels(labels)
    ax.set_xlabel("Time (secs)")
    ax.set_ylabel("Amplitude (dB)")
    plt.show()

In [None]:
def load_file(filename):
    waveform, sample_rate = torchaudio.load(f"{TRAIN_AUDIO_PATH}{filename}")
    waveform = waveform.numpy()
    return waveform, sample_rate

def play_file(filename, max_duration=None, show_waveform=True):
    waveform, sample_rate = load_file(filename)
    
    waveform = waveform[0, : (max_duration * sample_rate) if max_duration is not None else None]
    
    display(Audio(waveform, rate=sample_rate))
    
    if show_waveform:
        plot_waveform(waveform, sample_rate)

In [None]:
a_file = df_train.iloc[0]['filename']
play_file(a_file)

In [None]:
def play_species(species, df_train):
    filename = df_train[df_train['primary_label'] == species].sample(1)['filename'].iloc[0]
    display(HTML(f"<h2 style='color:green'>{species.capitalize()}</h2>{filename.split('/')[1]}"))
    play_file(filename, max_duration=10)

for species in scored[:2]:
    play_species(species, df_train)

# A more detailed overview of each species

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data):
**You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.**


In [None]:
DETAIL_BASE_URL = "https://ebird.org/species/"


def get_html(species):
    try:
        html = requests.get(f"{DETAIL_BASE_URL}/{species}").text
        return html
    except Exception as e:
        print(f"Exception trying to get html for {species}: {e}")
        return ""
    
    
def get_img_data(html):
    try:
        img_idx = html.find("https://cdn.download.ams.birds.cornell.edu/api/v1/")
        img_url = html[img_idx:img_idx+len("https://cdn.download.ams.birds.cornell.edu/api/v1/asset/") + 30]
        img_url = img_url[:img_url.find('"')]
        img = requests.get(img_url).content
        return img
    except Exception as e:
        print(f"Exception trying to get img: {e}")
        return ""
    
    
def get_and_display_image(html):
    display(Image(get_img_data(html), width=400, height=400))
    

def get_description(html):
    try:
        soup = BeautifulSoup(html, 'html.parser')
        description = soup.find_all('p', attrs={'class':'u-stack-sm'})[0].text.strip()    
    except:
        description = ""
    return description

    
def display_sound_types(df_species):
    df_species['type_list'] = df_species['type'].apply(eval)
    sounds = df_species.explode('type_list').groupby("type_list")['filename'].apply(list).to_dict()
    for sound_type, files_list in sounds.items():
        if sound_type in ['call', 'song']:
            a_file = files_list[0]
            waveform, sample_rate = load_file(a_file)
            duration = int(waveform.shape[1] / sample_rate)
            display(HTML(f"<h2> {sound_type}</h2>"))
            play_file(a_file, max_duration=None)
  
        
def show_species_details(species, df_train_full):

    df_species = df_train_full[df_train_full['primary_label'] == species].copy()
    name = df_species.iloc[0]['common_name']
    html = get_html(species)
    description = get_description(html)
    
    display(HTML(f"<h1 style='color:green'> {name} </h1>"))
    display(HTML(f"""<ul> <li>Label: <b>{species}</b></li> 
    <li> Scientific name: <b>{df_species.iloc[0]['scientific_name']}</b></li> 
    <li> Total training samples: <b>{len(df_species)}</b></li> 
    <li> Description: <b>{description}</b></li> 

    </ul>"""))
    
    get_and_display_image(html)
    display_sound_types(df_species)
        
df_train_full = pd.read_csv(f"{BASE_PATH}train_metadata.csv")
df_train_full.head()

In [None]:
#for species in scored:
#    show_species_details(species, df_train_full)
#    print("========================")


for species in ['akiapo', 'apapan', 'maupar', 'crehon']:
    show_species_details(species, df_train_full)
    print("========================")

There are a lot of different sounds and noises in there! In the last audio I think I can even hear flies, frogs, and various types of birds. I cannot distinguish a highlighted song over all that noise to be honest... it seems like a very hard task!

# test_soundscapes/ 

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data):
**When you submit a notebook, the test_soundscapes directory will be populated with approximately 5,500 recordings to be used for scoring. These are each within a few milliseconds of 1 minute long and in the ogg audio format. Only one soundscape is available for download.**

The submission files last about 1 minute.
There will be about `5500` when submitting, while there is only one when saving. The increase in the prediction runtime depends on the prediction code, so it will not necessarily be a `5500` factor.


In [None]:
!ls -l {BASE_PATH}test_soundscapes/

In [None]:
waveform, sample_rate = torchaudio.load(f"{BASE_PATH}test_soundscapes/soundscape_453028782.ogg")
plot_waveform(waveform[0], sample_rate)
Audio(waveform, rate=sample_rate)

I cannot detect a thing. The quality is very bad and the bird songs volumes are very low compared to the noise of the manipulation of the recorder.


# test.csv

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data):

>Metadata for the test set. Only the first three rows are available for download; the full test.csv is provided in the hidden test set.
> * **row_id** - A unique identifier for the row.
> * **file_id** - A unique identifier for the audio file.
> * **bird** - The ebird code for the row. There is one row for each of the scored species per 5 second window per audio file.
> * **end_time** - The last second of the 5 second time window (5, 10, 15, etc).



In [None]:
df_test = pd.read_csv(f"{BASE_PATH}test.csv")
df_test

In [None]:
df_test.shape

# sample_submission.csv

From the [Data tab](https://www.kaggle.com/c/birdclef-2022/data):

> sample_submission.csv - A valid sample submission. Only the first three rows are available for download; the full submission.csv is provided in the hidden test set.

> * **row_id** - A unique identifier for the row.
> * **target** - True/False for whether or not the bird in question called during the 5 second window.

In [None]:
df_sub = pd.read_csv(f"{BASE_PATH}sample_submission.csv")
df_sub

In [None]:
df_sub.shape

So, in the end, we need to have a function with the following signature:

$(\text{file} \times \text{scored bird} \times \text{5 seconds frame}) \rightarrow \text{presence / abscence}$

And the submission has the tuple $(\text{file} \times \text{scored bird} \times \text{5 seconds frame})$ encoded as `row_id`.

This is a very interesting problem. It is not a classification definitely. It might be considered a multilabel classification.

A few thoughts:
* The `secondary_labels` column might be very important in this scenario.
* Mixing-up various audio tracks to make sure various birds are present in a given audio.
* This fact is bugging me: the training audios don't have a 5-seconds window. The full track is labeled and the presence or absence of the bird doesn't have a 5-second resolution. This is something to address, although I don't know how right now.

In [None]:
df_train_full[df_train_full['secondary_labels'] != "[]"].head()

In [None]:
# 10% of the dataset has a non-empty "secondary label"
(df_train_full['secondary_labels'] != "[]").mean()

Since this notebook has Internet enabled, we will move to another one to play around with the submissions. We will start with https://www.kaggle.com/stefankahl/how-to-submit-to-birdclef-2022 and move on from there.

#  Please _DO_ upvote if you found this useful or interesting!