<img src = "https://images.unsplash.com/photo-1542382156909-9ae37b3f56fd?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1843&q=80" alt = "spectrogram" width = "800" height = "200"/>
<i>Photo by <a href="https://unsplash.com/@mehdisepehri?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Mehdi Sepehri</a> on <a href="https://unsplash.com/s/photos/bird?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></i>
  

# BirdCLEF2022 - EDA
Task: given a recording, identify whether the specific bird is calling.  
For testing, the recordings are chopped into 5s segments. For each segment, we have to identify whether a target bird is singing.

<h3><a href = "#part-0">Part 0 - Basic settings</a></h3>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#libraries">Libraries</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#functions">Functions</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#scored-birds">Scored Birds</a></p>

<h3><a href = "#part-1">Part 1 - EDA</a></h3>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#recording-length">1.1 Recording Length</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#secondary-labels">1.2 Secondary Labels</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#location">1.3 Location</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#rating">1.4 Rating</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#species">1.5 Information About the Species</a></p>

<h3><a href = "#part-2">Part 2 - Audio Features</a></h3>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#spectrogram">2.1 Waveform, Spectrogram</a></p>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#melspectrogram">2.2 Mel spectrogram, MFCCs</a></p>

<h3><a href = "#part-3">Part 3 - Outliers</a></h3>
<p style = "margin: 0;">&emsp;&emsp;<a href = "#maupar"> 3.1 Maupar</a></p>

<a id="part-0"></a>
# Part 0 - Basic settings

We do couple of things in basic settings
1. import libraries
2. define util functions
3. get the scored birds

<a id="libraries"></a>
## Libraries

In [None]:
import os
import re
import librosa
import librosa.display

import pandas as pd
import numpy as np
import seaborn as sns
import soundfile as sf
import matplotlib.pyplot as plt
import IPython.display as ipd

from tqdm import tqdm


sns.set_theme(style = "darkgrid")
data_src = "../input/birdclef-2022"

In [None]:
train_meta = pd.read_csv("../input/birdclef-2022/train_metadata.csv")
train_meta.head()

<a id="functions"></a>
## Functions
Feel free to ignore them before using them.

In [None]:
"""section 1.1"""
# get the sound length and sampling rate
def get_len(row):
    fdir = f"{data_src}/train_audio/{row['filename']}"
    data, sr = sf.read(fdir)
    return len(data) / sr, sr


"""section 2.1"""
# feature parameters
win_length = 1024
hop_length = 512
n_fft = 1024

# use * to indicate the scored species
def is_scored(bird_name):
    # if you didn't run the cell for scored list, forget it.
    try: scored 
    except: return bird_name
    return bird_name + "*" if bird_name in scored else bird_name


def insepct_recording(fname):
    
    # check if you pass a filename or a Series
    if type(fname) == str:
        fdir = data_src + "/train_audio/" + fname
        sample = train_meta[train_meta.filename == fname].iloc[0]
    else:
        fdir = data_src + "/train_audio/" + fname["filename"]
        sample = fname
        
    # get spectrogram
    signal, sr = librosa.load(fdir, sr = None)
    spec = librosa.stft(signal, win_length = win_length, hop_length = hop_length, n_fft = n_fft)
    spec_mag = abs(spec)
    spec_mag = librosa.amplitude_to_db(spec_mag)
    
    # set the title; display the secondary label.
    title = sample["primary_label"]
    title = is_scored(title)
    sec_labels = re.findall(r"'(\w+)'", sample["secondary_labels"])
    if sec_labels:
        n = len(sec_labels)
        sec_labels = ", ".join([is_scored(label) for label in sec_labels])
        title = title + f"\n{'secondary label' if n == 1 else 'secondary labels'}: {sec_labels}" 
    
    # plot
    plt.figure(figsize = (30,5))
    
    plt.subplot(1,2,1)
    librosa.display.waveshow(signal, sr = sr)
    plt.title(title)
    
    plt.subplot(1,2,2)
    librosa.display.specshow(spec_mag, x_axis = "time", y_axis = "mel", cmap = "rocket")
    plt.show()
    
    ipd.display(ipd.Audio(fdir))
    
    return signal, spec_mag
    
    
"""section 3.2"""

# plot spectrogram-like features
def plot_spec(spec, title, ax):
    librosa.display.specshow(spec, x_axis = "time", ax = ax, cmap = 'rocket')
    if i == 0: ax.set_title(title)
    if i != len(resolutions) - 1: ax.set_xlabel("")

        
def plot_features(filename, sr = None, resolutions = None):
    # get sr
    if sr is None:
        try:
            sr = train_meta.iloc[10]["sr"]
        except:
            sr = 16000
    
    # get resolution
    if resolutions is None:
        resolutions = [
            {"win_length": 2048, "hop_length": 512},
            {"win_length":int(25e-3 * sr), "hop_length":int(10e-3 * sr)},
        ]

    # check if you pass a filename or a Series 
    if type(filename) == str:
        sample = train_meta[train_meta.filename == filename].iloc[0]
    else:
        sample = filename
        
    # get the waveform
    fdir = data_src + "/train_audio" + "/" + sample["filename"]
    label = sample["primary_label"]
    signal, sr = librosa.load(fdir, sr = None)

    # create subplots
    fig, axes = plt.subplots(
        figsize = (30, 5 * len(resolutions)),
        ncols=4,
        nrows = len(resolutions),
        gridspec_kw={'width_ratios': [1,5,5,5]}
    )

    
    # loop through the resolutions
    for i, resolution in enumerate(resolutions):
        # stft
        spec = librosa.stft(y = signal, **resolution)
        spec = abs(spec)
        spec = librosa.amplitude_to_db(spec)

        # mel spectrogram
        melspec = librosa.feature.melspectrogram(y = signal, sr = sr, **resolution)
        melspec = abs(melspec)
        melspec = librosa.amplitude_to_db(melspec)

        # mfccs
        mfcc = librosa.feature.mfcc(y = signal, sr = sr, n_mfcc=13, **resolution)
        delta1 = librosa.feature.delta(mfcc, order = 1)
        delta2 = librosa.feature.delta(mfcc, order = 2)
        mfccs = np.concatenate([mfcc, delta1, delta2], axis = 0)

        # plot
        plt.suptitle("species: " + label)
        ax_text = axes[i, 0]
        ax_text.text(s = f"win length: {resolution['win_length']}\nhop length: {resolution['hop_length']}", x = 0.2, y = 0.5)
        ax_text.set_xticklabels([])
        ax_text.set_yticklabels([])
        ax_text.grid(False)
        ax_text.axis('off')

        plot_spec(spec, "spectrogram", axes[i, 1])
        plot_spec(melspec, "mel spectrogram", axes[i, 2])
        plot_spec(mfccs, "MFCCs", axes[i, 3])

        plt.subplots_adjust(wspace=0.1)

    plt.show()
    ipd.display(ipd.Audio(fdir))
    
    return signal, spec, melspec, mfccs
    

<a id="scored-birds"></a>
## Scored Birds

These birds are the target of this competetion.  
They are endangered, and we are going to save them.  
[More information](https://www.kaggle.com/amandanavine/hawaiian-bird-species)

In [None]:
# scored birds. these are what we need to recognize
import json

with open(f"{data_src}/scored_birds.json") as f: 
    scored = json.load(f)

train_meta["scored"] = train_meta.primary_label.isin(scored)

In [None]:
train_meta[train_meta.scored == True].primary_label.unique()

<a id="part1"></a>
# Part1: EDA

<a id="recording-length"></a>
## 1.1 Recording Length

In [None]:
"""
this cell takes a very long time to run.
to avoid that, I have prepared the complete version.
just select which code you are going to run by commenting / uncommenting
"""

## build the df from scratch 
# tqdm.pandas()

# length = train_meta.progress_apply(lambda row: get_len(row), axis = 1)
# train_meta["time_in_sec"] = length.apply(lambda row: row[0])
# train_meta["time_in_hour"] = train_meta["time_in_sec"] / 3600
# train_meta["sr"] = coupled.apply(lambda row: row[1])

## load the completed version
fpath = "../input/birdclef2022train-meta/train_meta.csv"
train_meta = pd.read_csv(fpath)

In [None]:
print(f"total hours: {train_meta.time_in_hour.sum() :.2f}")
train_meta.time_in_sec.describe()

In [None]:
plt.figure(figsize = (20,5))
plt.subplot(1,2,1)
sns.histplot(data = train_meta, x = "time_in_sec", hue = "scored")
plt.title("distribution of recording time")

plt.subplot(1,2,2)
sns.histplot(data = train_meta, x = "time_in_sec", hue = "scored")
plt.title("distribution of recording time (zoom in)")
plt.xlim([0,200])
plt.show()

The distribution is heavily right skewed. most of the recordings are 10s - 48s long.  
We also have some outliers.  
As mentioned in [here](https://www.kaggle.com/c/birdclef-2022/discussion/312355), two files are corrupted, they are `blkfra/XC649198.ogg`, `normoc/XC150238.ogg`, which happens to be the shortest ones.  
We should remove them. 

In [None]:
corrupted = train_meta[train_meta.time_in_sec < 0.2]
train_meta = train_meta[train_meta.time_in_sec > 0.2]

Then, we listen to the outliers.  
Noticed that we can't display the longest one here, but we can follow the link and listen there.

In [None]:
q_1 = train_meta.time_in_sec.quantile(0.1)
q_9 = train_meta.time_in_sec.quantile(0.9)
maxlen = train_meta.time_in_sec.max()

# random sample below 0.1 quantile
really_short = train_meta[train_meta.time_in_sec < q_1].sample(frac = 1, random_state = 1).iloc[0]

# random sample up to 0.9 quantile
really_long = train_meta[(train_meta.time_in_sec > q_9) & (train_meta.time_in_sec != maxlen)].sample(frac = 1, random_state = 1).iloc[0]
get_fdir = lambda fname: data_src + "/train_audio/" + fname

# the longest
the_longest = train_meta[train_meta.time_in_sec == train_meta.time_in_sec.max()].iloc[0]

print(f"really short recording {really_short.time_in_sec}s")
ipd.display(ipd.Audio(get_fdir(really_short.filename)))

print("---")
print(f"really long recording {really_long.time_in_sec}s")
ipd.display(ipd.Audio(get_fdir(really_long.filename)))

print("---")
print(f"the longest recording {the_longest.time_in_sec}s")
print(the_longest.url)

print("---")
print(f"recordings shorter than 0.5s: {(train_meta.time_in_sec < 0.5).sum()}")

Next, we should probably see the recording time of species.

In [None]:
xlabel = "time"
ylabel = "species"
time_counts = (train_meta
                .groupby("primary_label")
                .sum()
                .reset_index()[["primary_label", "time_in_sec", "scored"]]
                .rename({"primary_label": ylabel, "time_in_sec": xlabel}, axis = 1)
                .sort_values(by = xlabel, ascending = False)
               )
time_counts["scored"] = time_counts["scored"] >= 1

plt.figure(figsize = (10,30))
sns.barplot(data = time_counts, x = xlabel, y = ylabel, hue = "scored", dodge = False)
plt.title("total recording time of species")
plt.show()

<a id="secondary-labels"></a>
## 1.2 Secondary Labels

Why this is important?  
If a recording contains secondary labels, it means multiple birds are singing.  
If our model heard multiple birds singing, it might get confused.  
To illustrate what I mean, let's listen to a file

In [None]:
import IPython.display as ipd

sample = train_meta.iloc[1]
fname = sample.filename
fdir = f"{data_src}/train_audio/{fname}"
print(f"primary label: {sample['primary_label']}")
print(f"second labels: {sample['secondary_labels']}")
ipd.Audio(fdir)

I can recognize two different bird sounds at the beginning.  
And this column tells us we have in 3 birds singing in the background.  
This is the source of noise to our model.  
Next, let's take a look at how many recordings have the secondary labels.

In [None]:
import re

has_secondary_label = (train_meta
                       .set_index("primary_label")
                       .apply(lambda row: True if re.search(r"'(\w+)'", row["secondary_labels"]) else False, axis = 1)
                       .groupby("primary_label")
                       .sum()
                       .sort_values(ascending = False)
                       .reset_index()   
                       .rename({0: "count"}, axis = 1)
                      )

has_secondary_label["scored"] = has_secondary_label["primary_label"].isin(scored)
has_secondary_label = has_secondary_label[has_secondary_label["count"] > 0]
num_recordings = train_meta.groupby("primary_label").count()["secondary_labels"].reset_index().rename({"secondary_labels": "total_count"}, axis = 1)
has_secondary_label = pd.merge(has_secondary_label, num_recordings, on = "primary_label")

In [None]:
plt.figure(figsize = (10,30))
sns.barplot(data = has_secondary_label, y = "primary_label", x = "count", hue = "scored", dodge=False)
sns.barplot(data = has_secondary_label, y = "primary_label", x = "total_count", color = "b", alpha = 0.2)
plt.title("how many recordings have the secondary labels?")
plt.show()

Things are good for most of the birds. They have recordings, small portion of the recordings have noise, just as expected.  
But things are bad for birds like `hawcre`.  
For this bird, all the recordings have the secondary label, which means our model won't be able to hear the pure voice of this bird, unless the main bird and the background birds are somehow separable. 

<a id="location"></a>
## 1.3 Location

In [None]:
import geopandas as gpd
from shapely.geometry import Point, Polygon
from matplotlib.lines import Line2D

# the map is a matplotlib figure
# in world_map, each region is a polygon. 
# the corrdinates of those polygons are stored inside the file, and geopandas converts them into the matplotlib coordinates
world_map = gpd.read_file("../input/world-countries-shape-2017-pop-and-gdp/world-countries.shp")

# same thing, longitude, latitude --> matplotlib coordinate.
points = [Point(x,y) for x,y in zip(train_meta["longitude"], train_meta["latitude"])]
geo_df = gpd.GeoDataFrame(train_meta, crs="EPSG:4326", geometry=points)

# grabe the species count
species_id = geo_df["primary_label"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))
species_id.columns = ["ID", "primary_label", "count"]
geo_df = pd.merge(geo_df, species_id, how="left", on="primary_label")

# plot
fig, ax = plt.subplots(figsize = (20,10))
world_map.plot(ax = ax, alpha = 0.3)

print("plotting...")
for i in tqdm(range(geo_df.primary_label.nunique())):
    species = geo_df[geo_df["ID"] == i]
    if species.iloc[0]["scored"] == True:
        species.plot(ax=ax, markersize=10, marker="o", color = "r")
    else:
        species.plot(ax=ax, markersize=10, marker="o", color = "g", alpha = 0.3)
        
legend_elements = [ 
    Line2D([0], [0], color = "w", marker = 'o', markerfacecolor='r',label='scored'),
    Line2D([0], [0], color = "w", marker = 'o', markerfacecolor='g', label='not scored'),
]

# show hawii
p = Point(-155, 20).buffer(10)
x,y = p.exterior.xy
plt.plot(x,y)
plt.text(-180,20,"Hawii")

plt.legend(handles = legend_elements, facecolor = "w")
plt.show()

<a id="rating"></a>
## 1.4 Rating

In [None]:
ratings = train_meta["rating"].value_counts().reset_index().rename({"index": "rating", "rating": "count"}, axis = 1)
plt.figure(figsize = (10,5))
sns.barplot(data = ratings, x = "rating", y = "count", color = "b")
plt.title("rating statistics")
plt.show()

Just curious, what's rating 0.0 sounds like?

In [None]:
rating_zero = train_meta[train_meta["rating"] == 0].sample(frac = 1, random_state = 1).iloc[1]
fdir = data_src + "/train_audio/" + rating_zero["filename"]
ipd.display(ipd.Audio(fdir))

signal, sr = librosa.load(fdir, sr = None)
plt.specgram(signal, Fs = sr, NFFT = 1024, scale = "dB")
plt.grid(False)
plt.show()

print(f"rating zeros: {train_meta[train_meta['rating'] == 0].shape[0]}")

Well, it is really noisy. and we got 570 of them.  
To some extent, `rating` measures the quality of our data, just like `secondary_labels`  
Here we care about the species, so let's see the rating of each species

In [None]:
plt.figure(figsize = (10,30))
ratings = train_meta.groupby("primary_label").mean()["rating"].reset_index().sort_values(by = "rating", ascending = False)
ratings["scored"] = ratings["primary_label"].isin(scored)
sns.barplot(data = ratings, y = "primary_label", x = "rating", dodge = False, hue = "scored")
plt.title("what's the average rating of each species?")
plt.show()

<a id="species"></a>
### 1.5 Information About the Species

In [None]:
# print a general information
scored_time_counts = time_counts[time_counts["scored"]]
scored_ratings = ratings[ratings["scored"]]

information =  f"""
total species:                         {time_counts.shape[0]}
scored species:                        {scored_time_counts.shape[0]}
average time per species               {time_counts.time.mean() :.2f}s
average time per scored species:       {scored_time_counts.time.mean() :.2f}s
average ratings per species:           {ratings.rating.mean() :.2f}
average ratings per scored species:    {scored_ratings.rating.mean() :.2f}

min time per species:                  {time_counts.time.min() :.2f}
max time per species:                  {time_counts.time.max() :.2f}
"""


print(information)

<a id="part-2"></a>

# Part 2: Audio Features

In [None]:
# sampling rate
train_meta.sr.value_counts()

<a id="spectrogram"></a>

### 2.1 Waveform, Spectrogram
Waveform is useful to locate the silence. It is also useful if you want to use models like [sincnet](https://arxiv.org/pdf/1808.00158.pdf).  
Spectrogram shows the flow of frequencies. It is the most common way to grab a feeling about the audio.  
By combining waveform and spectrogram, you should have an intuition about how do birds sing.  
Here I display 5 samples. You can do more by changing `nsamples`.  
*Note: a star `*` after the title means this species is scored. If you didn't run the cell of creating the scored list, forget it.*

In [None]:
import librosa.display
import re

# sample
seed = 7
nsamples = 5
samples = train_meta.sample(nsamples, replace = False)

# show
for index, sample in samples.iterrows():    
    print("filename: " + sample["filename"])
    insepct_recording(sample["filename"])
    print("----------------------------------------\n")

**Observations**
1. the files are usually noisy. 
    * lots of noises have low frequency, like `belkin1/XC340666.ogg`, `comgal1/XC298133.ogg`, and `skylar/XC439352.ogg`. we can use low pass filter to remove them
    * we also have high frequency noise, like `lesyel/XC264114.ogg` and `buwtea/XC173321.ogg`. I think such noise is from insects.
    * some noises are distributed through the whole spectrogram(white noise), like `semplo/XC492111.ogg` and `arcter/XC582087.ogg`. I think this is because the bird is far away from the microphone.
    * we even have harmonic noise, like `redjun/XC599861.ogg`
    
2. interesting recordings
    * `hawcre/XC122284.ogg`: this recording contains 4 species, and all of them are scored.
    * `sora/XC174354.ogg`: It seems that this recording has multiple species but without secondary_labels. I suppose such thing happens quite commonly.
    * `norpin/XC627557.ogg`: I can't really hear the birds, but the rating of this recording is 4.
    * `leasan/XC545877.ogg`: It's like multiple little fires 🔥🔥🔥

To inspect the recordings I listed above, just use funcion `inspect_recording`.

In [None]:
signal, spec = insepct_recording("leasan/XC545877.ogg")

You can also pass a sample to it. But the sample should be in type `pd.Series`

In [None]:
assert type(train_meta.sample(1).iloc[0]) == pd.Series
signal, spec = insepct_recording(train_meta.sample(1).iloc[0])

<a id="melspectrogram"></a>
## 2.2 Mel spectrogram, MFCCs

Mel spectrogram uses filter banks to contruct another form of spectrogram, which shrink the dynamic range of high frequency and stretch the dynamic range of low frequency. This behavior is very similar to our ears.  
MFCCs removes unnecessary frequencies by performing a cosine transform. 

Since the code for those features are very similar, I wrote a function `plot_features`.  
It works exactly the same as `insepct_recording`.  
But here you can provide a `resolution` dictionary.  
This is useful to inspect the impact of the fft bins, window size and hop length. So that you can choose the best settings for your model.  
All the arguments of `librosa.feature.stft` are welcome.  
Last but not least, it also returns `signal, spec, melspec, mfccs` **of the last resolution**

In [None]:
sr = train_meta.iloc[0].sr

resolutions = [
        {"win_length": 2048, "hop_length": 512},
        {"win_length": 4096, "hop_length": 1024, "n_fft": 4096},
        {"win_length":int(25e-3 * sr), "hop_length":int(10e-3 * sr)},
        ]
signal, spec, melspec, mfccs = plot_features("leasan/XC545877.ogg", resolutions = resolutions)

In [None]:
signal, spec, melspec, mfccs = plot_features(train_meta.sample(1).iloc[0], resolutions = resolutions)

<a id="part-3"></a>
# Part3 Outliers
Sometimes the RAM is not enough and you want to sample the data. Be careful to these outliers.

<a id="maupar"></a>
## 3.1. Maupar - the **scored birds** with only **one** recording, but it is long and complicated

<img src = "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/96603681/1200" alt = "maupar" style = "zoom: 50%;" />

[img source](https://ebird.org/species/maupar)

In [None]:
maupar = train_meta[train_meta.primary_label == "maupar"]
print(f"number of samples: {maupar.shape[0]}; recording length: {maupar.time_in_sec.values[0]}s")

In [None]:
signal, spec = insepct_recording(maupar.iloc[0])

Notice that the secondary labels of this recording are also scored.