In this notebook I'll explore the BirdClef 22 dataset, demonstrate how to visualize its audio data, analyzing audio durations, inspecting each training metadata column, and share any additional insights or tidbits I've discovered from playing around.

# Getting the data

In [None]:
import numpy as np
import pandas as pd
import torchaudio
import torch
from IPython.display import Audio
import matplotlib.pyplot as plt
from tqdm import tqdm

Let's first open the training metadata CSV to see what's available.

In [None]:
df = pd.read_csv('/kaggle/input/birdclef-2022/train_metadata.csv')
df.head()

One interesting challenge in this competition is that only 21 Hawaiian birds will be tested in the submission, even though the training data actually contains 152 birds in total (which we'll see when we analyze the label columns). 

So let's get the birds that will be used in the submission (these are called "scored" birds).

In [None]:
scored_birds = set(pd.read_json('/kaggle/input/birdclef-2022/scored_birds.json')[0])
scored_birds, len(scored_birds)

Another interesting challege with this competition is that the training data in the audio files are variable length. Some are very short, some very long. To understand this better, let's also grab the lengths of each audio file and add that as a column to our metadata. This might take a few minutes, since there's about ~15,000 files!

Notice how I use `torchaudio.info` here instead of `torchaudio.load`. I'm just getting the length of the audio, so `.info` is faster.

In [None]:
lengths = []
for filename in tqdm(df['filename']):
    metadata = torchaudio.info('/kaggle/input/birdclef-2022/train_audio/' + filename)
    lengths.append(metadata.num_frames / metadata.sample_rate)
df['audio_length_secs'] = lengths

In total, the training data has 21,952,580,839 audio samples and is ~190.6 hours long.

In [None]:
total_length = sum(df['audio_length'])
total_length, total_length * 32000, total_length / 60 / 60

# Hear and visualize an example

Let's listen a single audio example from the training set. Each one is rated based on its quality, so let's just listen to one that has a 5 star rating.

In [None]:
datum = df[df['rating'] == 5].iloc[0] # get the first 5 star rated recording
audio, rate = torchaudio.load('/kaggle/input/birdclef-2022/train_audio/' + datum['filename'])
num_samples = audio[0].numel()
print(f'{num_samples} samples at {rate / 1000} kHz (~{round(num_samples / rate)} seconds)')
display(Audio(audio, rate=rate))
datum

Let's take a look at the first 5 seconds of audio from this sample to see what the first bird vocalization in that sample actually looks like. You can do this a number of ways, but the simplest is to average the audio's channels and then plot the samples over time.

Spectrograms are also useful visualization technqiue. Similar to convolution, they slide a window over the audio and extract frequencies at each step. Taking the log of the spectrogram makes it a bit easier to see its structure.

In [None]:
averaged_audio_clip = audio.mean(0)[:rate*5]

plt.title('first 5 seconds of audio as pressure over time')
plt.plot(averaged_audio_clip)
plt.show()

plt.title('first 5 seconds of audio as a log magnitude spectrogram over time')
plt.imshow(torch.log10(torchaudio.transforms.Spectrogram(n_fft=512)(averaged_audio_clip)));

# Columns

Now let's go back to the training metadata CSV, and investigate each column.

In [None]:
for key in df:
    print(key)
    display(df[key].describe())
    print()

## Labels

Here's the competition's description of the "primary_label" and "secondary_labels" columns:

> primary_label: a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.

> secondary_labels: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.

Note that the distribution of individual secondary labels seems to be a bit different than that of primary labels. Also, it looks like there's some primary labels that never appear as secondary labels, but all secondary labels appear as a primary label.

In [None]:
df.primary_label.value_counts()[:30].plot.bar(width=0.9, title="Top 30 primary labels", ylabel="occurrences")
df.primary_label.value_counts()

In [None]:
secondary_labels_flattened = pd.Series(np.concatenate(df.secondary_labels.apply(lambda v: v[2:-2].split("', '"))))
secondary_labels_flattened = secondary_labels_flattened[secondary_labels_flattened != '']
secondary_labels_flattened.value_counts()[:30].plot.bar(width=0.9, title="Top 30 secondary labels", ylabel="occurrences")
secondary_labels_flattened.value_counts()

In [None]:
all_labels = df.primary_label.append(secondary_labels_flattened)
all_labels.value_counts()[:30].plot.bar(width=0.9, title="Top 30 primary+secondary labels", ylabel="occurrences")
all_labels.value_counts()

## Filter labels by scored birds

Only about 10% of training samples have a scored bird that is a primary label or in secondary labels. So one interesting aspect of this competition is how to best use data that has labels that won't be scored.

Looking at their distribution, we can see another challenge with this competition: class imbalance. The most common scored bird "skylar" is in 500+ recordings, but there's a long tail; the least common bird "maupar" is only in 1 recording. This is a huge challenge, because the competition's evaluation uses a macro F1 score, which considers the performance of each class equally important.

In [None]:
primary_labels_filter = df.primary_label.isin(scored_birds)
secondary_labels = df.secondary_labels.apply(lambda v: v[2:-2].split("', '"))
secondary_label_filter = np.array([len(scored_birds.intersection(x)) > 0 for x in secondary_labels])

pct_scored_bird = 100 * len(df[primary_labels_filter | secondary_label_filter]) / len(df)
print(f'% of samples with a scored bird in the primary or secondary labels: {pct_scored_bird}%')

In [None]:
df.primary_label[df.primary_label.isin(scored_birds)].value_counts().plot.bar(
    width=0.9, title="Primary labels - scored", ylabel="occurrences"
)
df.primary_label[df.primary_label.isin(scored_birds)].value_counts()

In [None]:
all_labels[all_labels.isin(scored_birds)].value_counts().plot.bar(
    width=0.9, title="Primary + secondary birds - scored", ylabel="occurrences"
)
all_labels[all_labels.isin(scored_birds)].value_counts()

## Audio lengths

Instead of ranking birds just by the # of recordings, let's rank by the length of audio.

In [None]:
(df.groupby(['primary_label']).audio_length.agg(sum) / 60 / 60).sort_values(ascending=False)[:30].plot.bar(
    width=0.9, title="Top 30 primary labels", ylabel="minutes"
)

In [None]:
(df[df.primary_label.isin(scored_birds)].groupby(['primary_label']).audio_length.agg(sum) / 60 / 60).sort_values(ascending=False).plot.bar(
    width=0.9, title="Primary labels - scored", ylabel="minutes"
)

We can also bin each recording by its length, to get an idea of the distribution of durations. This shows that ~80% of recordings are under a minute, but there are some outliers: in the most extreme case, one recording is 75 minutes long!

In [None]:
plt.yscale('log')
plt.ylabel('# of samples')
plt.xlabel('length of sample (minutes)')
plt.title('Distribution of audio lengths')
plt.hist([l/60 for l in df.audio_length], bins=np.arange(0, 81, 0.5));

In [None]:
plt.yscale('log')
plt.ylabel('# of samples')
plt.xlabel('length of sample (minutes)')
plt.title('Distribution of audio lengths for scored birds')
plt.hist([l/60 for l in df[df.primary_label.isin(scored_birds)].audio_length], bins=np.arange(0, 81, 0.5));

In [None]:
df.audio_length.describe()

In [None]:
(df[df.primary_label.isin(scored_birds)].audio_length/60).describe()

In [None]:
print(f'% of audio <= 1 minute long: {100 * sum(df.audio_length < 60) / len(df)}%')
print('hours of audio in samples with scored primary labels:', sum(df[df.primary_label.isin(scored_birds)].audio_length) / 60 / 60)

## Calls

The calls column appear to be a list of tag-like descriptions of the recording. The top two are "call" and "song".

By the way, apparently there's a big difference between a "birdcall" and a "birdsong", as explained by the [wikipedia page on Bird Vocalization](https://en.wikipedia.org/wiki/Bird_vocalization): 

> In ornithology and birding, songs (relatively complex vocalizations) are distinguished by function from calls (relatively simple vocalizations).

In [None]:
types = pd.Series(np.concatenate(df.type.apply(lambda v: v[2:-2].split("', '"))))
types.value_counts()[:30].plot.bar(width=0.9)
types.value_counts()

## Author

According to the competition description, "Author" is

> the eBird user who provided the recording

Paul Marvin is the top contributor by far--what a hero! From googling I found a note about him [on Cornel Lab's facebook page](https://www.facebook.com/macaulaylibrary/posts/10156633504705424):

> Paul lives in Cocoa, Florida, where he has taken advantage of the warm weather and abundant birdlife to make numerous recordings this winter in conjunction with complete eBird checklists. When he is not out in the field making new recordings, he has also been creating historical checklists to upload some of the many recordings he has made since he started recording in earnest in 2011.

In [None]:
df.author.value_counts()[:30].plot.bar(width=0.9, title="Author")
df.author.value_counts()

## Rating

According to the competition description, "Rating" is:

> Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.

Fortunately, it looks like most of the recordings have a high rating.

In [None]:
df.rating.plot.hist(bins=10, title="Rating")
df.rating.value_counts()

## Time

I'm guessing the "Time" column is is the time of day when the recording took place. This column is pretty messy, so I wrote some code to clean it up and convert it to a proper time of day. You can see an interesting peak at 8 AM--turns out, there's a name for that peak: the [dawn chorus](https://en.wikipedia.org/wiki/Dawn_chorus_(birds))!

In [None]:
def clean_time(t):
    add_12 = False
    if t.endswith('am'):
        t = t[:-2]
    elif t.endswith('pm'):
        t = t[:-2]
        add_12 = True
    parts = t.split(':')
    if parts[0].isnumeric():
        parts[0] = str(int(parts[0]) + 12) if add_12 else parts[0]
    t = ':'.join(parts)
    if len(parts) == 2:
        t = t + ':00'
    return t

coerce_times = pd.to_timedelta(df.time.apply(clean_time), errors="coerce")
coerce_times.astype('timedelta64[h]').plot.hist(bins=24, title="Time of day (hour)")
plt.ylabel("# of recordings")
plt.grid()

Here's all the times that I wasn't able to parse. You can see that some of them are just question marks.

In [None]:
# Cases that couldn't be converted
df.time[coerce_times.isnull()].value_counts()