# 🐦 About the competition

 ### 🥅 Goal
 
 **Indentify the species of the birds given raw audio in the wild.**
 
 ### 🧠 Understanding the given data
 
 **Training data:**
 * `train_short_audio`: Directory of short recordings of individual bird calls generously uploaded by users of [xenocanto.org](https://www.xeno-canto.org/). **This data will be used to train fine-grained audio classifier**. The audio recordings are downsampled to 32kHz and are in [ogg format](https://en.wikipedia.org/wiki/Ogg).
 * `train_soundscapes`: Directory of audio files that are comparable to the data that we will encounter in the test set. **This is the raw audio that's mentioned in the goal**. The recordings are are all roughly ten minutes long and in the ogg format. 
 * `train_metadata.csv`: Wide range of metadata provided for the training data. 
 * `train_soundscape_labels.csv`: This `csv` can be used to build the inference pipeline. 
 
**Testing data:**
 * `test_soundscapes`: Directory of recordings to be used for scoring. There are approximately 80 recordings **during submission** that will be will be roughly 10 minutes long and in ogg audio format.
 * `test.csv`: Same tabular information as `train_soundscape_labels.csv`.
 
### Evaluation 

In the `submission.csv`, for each `row_id/time window`, you need to provide a space delimited list of the set of unique birds that made a call beginning or ending in that time window. If there are no bird calls in a time window, use the code `nocall`.

The submissions will be evaluated based on their row-wise micro averaged [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

### About Weights and Biases

Think of W&B like GitHub for machine learning models. With a few lines of code, save everything you need to debug, compare and reproduce your models — architecture, hyperparameters, model weights, GPU usage, and even datasets and predictions.

* Create an account on https://wandb.ai.
* Input your personal API token key to login (mine is added as [Kaggle Secrets](https://www.kaggle.com/product-feedback/114053))


In [None]:
%%capture 
# To get the latest version of W&B
!pip install wandb --upgrade
!pip install tensorflow-io

In [None]:
import tensorflow as tf
import tensorflow_io as tfio

import os
os.environ['WANDB_SILENT'] = "true"
import re
import gc
import glob
import wandb
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
%matplotlib inline

import IPython.display as ipd

# Map libraries
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

from kaggle_secrets import UserSecretsClient

# Audio specific imports
import librosa as lb
import librosa.display

# W&B login
user_secrets = UserSecretsClient()
wandb_api = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api)

# 📁 1. The CSV Files

## `train_metadata.csv`

In [None]:
METADATA_FILE_PATH = '../input/birdclef-2021/train_metadata.csv'
metadata_df = pd.read_csv(METADATA_FILE_PATH)
metadata_df.head(5)

In [None]:
# Load the file as W&B artifact. 
run = wandb.init(project='birdclef', group='EDA')
artifact = wandb.Artifact('train-metadata', type='dataset')
artifact.add_file(METADATA_FILE_PATH)
run.log_artifact(artifact)
run.finish()

> 📌 `primary_label`: The primary bird sound that can be heard in the recording. <br>
> 📌 `secondary_label`: Extra birds sounds present in the recording. <br>
> 📌 `type`: The type of bird sound. <br>
> 📌 `latitude` and `longitude`: Location where the recording was done. <br>
> 📌 `scientific_name`: The scientific name of the bird species. <br>
> 📌 `common_name`: The bird known in common language. <br>
> 📌 `author`: Individual's name who recorded (possibly) and uploaded the audio. <br>
> 📌 `date`: The date the audio was recorded (possibly) and uploaded. <br>
> 📌 `filename`: The name of the audio file. <br>
> 📌 `license`: The license associated with that recording. <br>
> 📌 `rating`: The audio quality. <br>
> 📌 `time`: The time of the day the recording was uploaded. <br>
> 📌 `url`: The xenocanto.org url to d

#### What's the distribution of primary labels?

In [None]:
# Reference: https://www.kaggle.com/shahules/bird-watch-complete-eda-fe
# Unique eBird codes
species = metadata_df['primary_label'].value_counts()

# Make bar chart
fig = go.Figure(data=[go.Bar(y=species.values, x=species.index)],
                layout=go.Layout(margin=go.layout.Margin(l=0, r=0, b=10, t=50)))

# Show chart
fig.update_layout(title='Number of traning samples per species')
fig.show()

#### What quality of audio recording available?

In [None]:
plt.figure(figsize=(16, 6))
ax = sns.countplot(x = metadata_df['rating'], palette="hls", order = metadata_df['rating'].value_counts().index)

plt.title("Sound quality rating", fontsize=16)
plt.ylabel("Count", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks(rotation=45, fontsize=13)
plt.xlabel("");

> 📌 Thankfully most of the audio is high quality. 

#### Where are the birds located?

In [None]:
# Reference: https://www.kaggle.com/andradaolteanu/birdcall-recognition-eda-and-audio-fe
# SHP file
world_map = gpd.read_file("../input/world-shape-file/99bfd9e7-bb42-4728-87b5-07f8c8ac631c2020328-1-1vef4ev.lu5nk.shp")

# Coordinate reference system
crs = {"init" : "epsg:4326"}

# Lat and Long need to be of type float, not object
data = metadata_df[metadata_df["latitude"] != "Not specified"]
data["latitude"] = data["latitude"].astype(float)
data["longitude"] = data["longitude"].astype(float)

# Create geometry
geometry = [Point(xy) for xy in zip(data["longitude"], data["latitude"])]

# Geo Dataframe
geo_df = gpd.GeoDataFrame(data, crs=crs, geometry=geometry)

# Create ID for species
species_id = geo_df["primary_label"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))

species_id.columns = ["ID", "primary_label", "count"]

# Add ID to geo_df
geo_df = pd.merge(geo_df, species_id, how="left", on="primary_label")

# === PLOT ===
fig, ax = plt.subplots(figsize = (16, 10))
world_map.plot(ax=ax, alpha=0.4, color="grey")

palette = iter(sns.hls_palette(len(species_id)))

for i in range(264):
    geo_df[geo_df["ID"] == i].plot(ax=ax, markersize=20, color=next(palette), marker="o", label = "test");

> 📌 Most of the audio recordings come from North and South America. A fair share of recordings are from Europe.

#### When was the audio uploaded?

In [None]:
date_uploaded = metadata_df['date'].apply(lambda x: x.split('-')[0])

plt.figure(figsize=(16, 6))
ax = sns.countplot(x=date_uploaded.values, palette="hls")

plt.title("Audio Files Registration per Year Made", fontsize=16)
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=13)
plt.ylabel("Frequency", fontsize=14)
plt.xlabel("");

> 📌 Most of the recordings were uploaded in the year between 2012 and 2020. <br>
> 📌 Also note that some of the years are wrong (ex: 2104, 0199, etc).

## `train_soundscape_labels.csv`

In [None]:
TRAIN_SOUNDSCAPE = '../input/birdclef-2021/train_soundscape_labels.csv'
train_soundscape_df = pd.read_csv(TRAIN_SOUNDSCAPE)
train_soundscape_df.head()

Reference: https://www.kaggle.com/stefankahl/birdclef2021-exploring-the-data
> 📌 `row_id`: Unique identifier of a 5-second segment of each soundscape file. <br>
> 📌 `site`: Recording site of the soundscape data. In this competition, 4 different sites (COL = Colombia, COR = Costa Rica, SNE = Sierra Nevada, SSW = Sapsucker Woods) are included. <br>
> 📌 `audio_id`: Identifier used to reference audio recordings. Filenames contain the file ID, recording site and recording date (yyyymmdd). <br>
> 📌 `seconds`: End time of the 5-second segment for which this entry states the label. <br>
> 📌 `birds`: primary label (i.e., eBird code) of the audible species of this segment. “nocall” references a segment without any bird vocalization. Segments can have more than one bird, in that case, eBird codes are separated by space. “nocall” can never appear together with other codes.


#### Most commom birds vocalization found?

In [None]:
train_soundscape_df['birds'].value_counts()

> 📌 Many segments are associated with `nocall`. `rucwar`, `bobfly1`, etc are commonly found. 
> 📌 There are segments with two or more birds volcalization found. **We thus need to build a multi-label classifier**.

## `test.csv`

In [None]:
TEST = '../input/birdclef-2021/test.csv'
test_df = pd.read_csv(TEST)
test_df.head()

# 🔈 2.  The Audio Files

**TL;DR for `train_short_audio`**

**Usage**: To train fine-grained multi-class (label) audio classifier. <br>
**Num species**: 397 <br>
**Num training samples**: 62874 <br>
**Class-imbalance**: Yes

#### Number of bird species

In [None]:
SHORT_TRAIN = '../input/birdclef-2021/train_short_audio/'
print(f'Number of unique bird species: {len(os.listdir(SHORT_TRAIN))}')

#### Number of training data

In [None]:
recordings_per_label = {'species_id': [], 
                        'num_audio': []}

for label in os.listdir(SHORT_TRAIN):
    num_recordings = len(os.listdir(SHORT_TRAIN+label))
    recordings_per_label['species_id'].append(label)
    recordings_per_label['num_audio'].append(num_recordings)
        
recordings_per_label = pd.DataFrame.from_dict(recordings_per_label)

run = wandb.init(project='birdclef', group='EDA')
data = [[label, val] for (val, label) in sorted(zip(recordings_per_label.num_audio.values, recordings_per_label.species_id.values))[::-1]]
table = wandb.Table(data=data, columns = ["species_id", "num_audio"])
wandb.log({"recordings_per_label" : wandb.plot.bar(table, "species_id", "num_audio",
                               title="Number of recordings per label")})
run.finish()

# display W&B run page
run

(☝️ [W&B Run Page](https://wandb.ai/ayush-thakur/birdclef/runs/1grwh85p?workspace=user-ayush-thakur))
> ⚠️ Disclaimer: Since there are too many labels, the `specied_id` (y-axis) looks clumsy. <br>
> 📌 Pro-tip 1: Scroll over the bar chart to look at the number of recordings per `species_id`. <br>
> 📌 Pro-tip 2: Click on the ✏️ (Edit panel) icon in the chart to visualize the expanded version of the bar chart. <br>
> 📌 Pro-tip 3: In the `recordings_per_label_table` you can sort the columns in ascending or descending order by clicking on the column name. 

There's significant class imbalance. Species `crfpar` and `stvhum2` got only 8 audio files. While 12 labels got 500 audio files each. Play with the W&B dashboard above to get more insight.

#### Number of recordings per label

In [None]:
num_data = np.sum(recordings_per_label['num_audio'])
print(f'Number of training data: {num_data}')

### Listen to Audio

We will use Weights and Biases to visualize audio waveforms and listen to bird sounds.

In [None]:
NUM_AUDIO_PER_LABEL = 8 # 
audio_per_label = {}

for label in os.listdir(SHORT_TRAIN):
    # add new key (label) to dict
    audio_per_label[label] = []
    # get NUM_AUDIO_PER_LABEL audio filenames per label
    audio_filenames = random.sample(os.listdir(SHORT_TRAIN+label), NUM_AUDIO_PER_LABEL)
    # append path to that file
    for audio_filename in audio_filenames:
        audio_per_label[label].append(SHORT_TRAIN+label+'/'+audio_filename)
        
# We will use Weights and Biases to visualize audio waveforms and listen to bird sounds.
run = wandb.init(project='birdclef', group='EDA')
for label, audio_paths in audio_per_label.items():
    audio_arr = []
    sr_arr = []
    for audio_path in audio_paths:
        audio, sr = lb.load(audio_path)
        audio_arr.append(audio)
        sr_arr.append(sr)
        
    # log audio data for each label per step. 
    wandb.log({'audio-samples': [wandb.Audio(audio, caption=f'{label}', sample_rate=sr) 
                                             for audio, sr in zip(audio_arr, sr_arr)]})

run.finish()

# display W&B run page
run

(☝️ [W&B Run Page](https://wandb.ai/ayush-thakur/birdclef/runs/2713uqst?workspace=user-ayush-thakur))

> 📌 Note: The dashboard above have 8 samples of audio for each label. <br>
> 📌 Pro-tip 1: Click on the ⚙️ in the `audio-samples` chart above. There will be total of 397 steps, where each step represents an unique label. <br>
> 📌 Pro-tip 2: The audio will appear for the selected step (label). Each audio player's caption is the label name. 

You can visualize the waveform of the audio beside listening to the music with just 3 lines of code. **I highly recommend spending time listening to the sample audio.******

👀 Some quick observations:
* The audio duration varies from few seconds to few minutes.
* The audio is noisy with the sound of rain, insects, humans talking, wind blowing, etc. 
* In some audio the sound of bird is coming from far off distance. 

# 🏆 3. Audio Features & Preprocessing

## Audio Normalization

In [None]:
# A sample audio. Change species_id to select different label
sample_audio = audio_per_label['caltow'][3]
# Load unnormalized audio
audio, sr = lb.load(sample_audio)
# Normalize audio
norm_audio = librosa.util.normalize(audio)

print(f'The shape of audio: {audio.shape}; sampling rate: {sr}; audio duration: {audio.shape[0]/sr} s')

run = wandb.init(project='birdclef', group='EDA')
wandb.log({'audio_sample': [wandb.Audio(audio, caption=f'Audio Sample', sample_rate=sr)]})
wandb.log({'normalized_audio-sample': [wandb.Audio(norm_audio, caption=f'Normalized Audio Sample', sample_rate=sr)]})
run.finish()

run

> 📌 Note that this constrained each signal between 0 and 1. Not sure if it would be a great idea to normalize the audio like this. Here's a short [reddit post](https://www.reddit.com/r/MachineLearning/comments/4ea0m7/audio_normalizationpreprocessing_before/). 

## Trim the noise

In [None]:
position = tfio.experimental.audio.trim(audio, axis=0, epsilon=0.1)
trimmed_audio = audio[position[0]:position[1]]

run = wandb.init(project='birdclef', group='EDA')
wandb.log({'Trimmed': [wandb.Audio(trimmed_audio, caption=f'Trimmed', sample_rate=sr)]})
run.finish()

run

> 📌 Note how the initial and final silent waveform is removed. It might be useful preprocessing step after getting 5 second clips.

#### Let's see the effect of trimming on 5 second clips of the audio. 

In [None]:
audio_clips = []
audio_time = len(audio)//sr
print(f'The duration of audio is: {audio_time}')

start_sample = 0
end_sample = sr*5 # sampling rate is number of samples per second. 

for i in range(audio_time//5):
    audio_clips.append(audio[start_sample:end_sample])
    start_sample = end_sample
    end_sample+=sr*5
    
run = wandb.init(project='birdclef', group='EDA')
for i, audio_clip in enumerate(audio_clips):
    # Trim audio
    position = tfio.experimental.audio.trim(audio_clip, axis=0, epsilon=0.1)
    trimmed_audio = audio_clip[position[0]:position[1]]
    
    # Log clipped and trimmed audio
    audio_arr = [audio_clip, trimmed_audio]
    captions = ['Clipped', 'Trimmed']
    
    wandb.log({f'clipped_vs_trimmed': [wandb.Audio(aud, caption=f'{caption}', sample_rate=sr)
                                              for aud, caption in zip(audio_arr, captions)]})

run.finish()

run

> 📌 Click on the ⚙️ in the clipped_vs_trimmed chart above. Each step is a unique pair of audio clip and it's trimmed counterpart. 
> 📌 Note: By trimming we are losing on the background sound but there's no control over what is lost. Thus this might not be a good strategy to be applied on 5 second clips.

## Fade In and Fade Out

In [None]:
fade = tfio.experimental.audio.fade(trimmed_audio, fade_in=100000, fade_out=200000, mode="exponential")

fig, ax = plt.subplots(2, figsize = (20, 8), dpi=120)
fig.suptitle('Original Vs Trimmed', fontsize=16)
lb.display.waveplot(trimmed_audio, sr=sr, ax=ax[0])
lb.display.waveplot(fade.numpy(), sr=sr, ax=ax[1])
plt.show()

> 📌 [](http://)This Wikipedia page might be a good read. A recorded audio may be gradually reduced to silence at its end (fade-out), or may gradually increase from silence at the beginning (fade-in). Again I am not sure if it's going to be useful.

## Fast Fourier Transform

In [None]:
# compute fft
fft = np.fft.fft(audio)
# compute frequency 
freq = np.fft.fftfreq(audio.size, 1/sr)
freq = freq[:len(freq)//2]
print(f'Max frequency in the audio: {freq[-1]}. Obviously it is going to be half of sampling rate')
# get the magnitude
mag_fft = abs(fft)
# Remember 2nd half of the fft is repeated.
mag_fft = mag_fft[:len(mag_fft)//2]

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(20, 8), dpi=120);
fig.suptitle('Fast Fourier Transform')
ax[0].plot(audio)
ax[0].set_xlabel('Time', fontsize=16);
ax[0].set_ylabel('Amplitude', fontsize=16);

ax[1].plot(freq, mag_fft)
ax[1].set_xlabel('Freq(Hz)', fontsize=16);
ax[1].set_ylabel('Power', fontsize=16);

> 📌 Note that while computing FFT the relevant information is in the first half of the array. The other half is simply repeated in inverse indices order. <br>
> 📌 There are some major regions of frequency concentration that too in high frequency region. <br>
> 📌 Note that there is numerical difference while computing FFT using TensorFlow and Numpy. For training a neura network it might not be an issue.

## Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. 

### Linear Power Spectrogram

In [None]:
# Parameters
n_fft = 2048
hop_length = 512

In [None]:
# Short-time Fourier transform (STFT)
S = np.abs(lb.stft(audio, n_fft=n_fft, hop_length=hop_length))
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
S_to_DB = librosa.amplitude_to_db(S, ref=np.max)

# Plot spectrogram
plt.figure(figsize=(16,4))
lb.display.specshow(S_to_DB, y_axis='linear', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note linear y axis
plt.title('Linear-Frequency Power Spectrogram');
plt.colorbar();

> 📌 `n_fft=2048` samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz.

### Log Power Spectrogram

In [None]:
# Short-time Fourier transform (STFT)
S = np.abs(lb.stft(audio, n_fft=n_fft, hop_length=hop_length))
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
S_to_DB = librosa.amplitude_to_db(S, ref=np.max)

# Plot spectrogram 
plt.figure(figsize=(16,4))
lb.display.specshow(S_to_DB, y_axis='log', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note logarithmic yaxis
plt.title('Log-Frequency Power Spectrogram');
plt.colorbar();

### Constant Q (Note) Power Spectrogram

In [None]:
# Short-time Fourier transform (STFT)
CQT_note = np.abs(lb.cqt(audio, sr=sr, hop_length=hop_length))
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
CQT_note = librosa.amplitude_to_db(CQT_note, ref=np.max)

# Plot spectrogram
plt.figure(figsize=(16,4))
lb.display.specshow(CQT_note, y_axis='cqt_note', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note cqt_note y axis
plt.title('Constant-Q (Note) Power Spectrogram');
plt.colorbar();

### Constant Q (Hz) Power Spectrogram

In [None]:
# Short-time Fourier transform (STFT)
CQT_note = np.abs(lb.cqt(audio, sr=sr, hop_length=hop_length))
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
CQT_note = librosa.amplitude_to_db(CQT_note, ref=np.max)

# Plot spectrogram 
plt.figure(figsize=(16,4))
lb.display.specshow(CQT_note, y_axis='cqt_hz', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note cqt_hz y axis
plt.title('Constant-Q (Hz) Power Spectrogram');
plt.colorbar();

### Tempogram with BPM markers

In [None]:
Tgram = lb.feature.tempogram(y=audio, sr=sr)

# Plot spectrogram
plt.figure(figsize=(16,4))
lb.display.specshow(Tgram, y_axis='tempo', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note cqt_hz y axis
plt.title('Tempogram with BPM Markers');
plt.colorbar();

### Chromagram with Pitch Classes

In [None]:
C = librosa.feature.chroma_cqt(y=audio, sr=sr)

# Plot spectrogram
plt.figure(figsize=(16,4))
lb.display.specshow(C, y_axis='chroma', sr=sr, hop_length=hop_length,
                   x_axis='time') # Note cqt_hz y axis
plt.title('Chromagram with Pitch Class');
plt.colorbar();

### Let's look at the spectrogram for every class using W&B.

**We will visualize linear-frequency power spectrogram for audio with rating 4 and more.**

In [None]:
def get_path(label, filename):
    return '../input/birdclef-2021/train_short_audio/'+f'{label}/{filename}'

# Dataframe audio filenames with rating 4 or more.
metadata_tmp_df = metadata_df.loc[metadata_df['rating'] > 4]
metadata_tmp_df.loc[:, 'kaggle_path'] = metadata_tmp_df.apply(lambda row: get_path(row['primary_label'],
                                                                            row['filename']), axis=1)

# Get random samples for each label.
audio_per_label = {}

for label in os.listdir(SHORT_TRAIN):
    tmp = metadata_tmp_df.loc[metadata_tmp_df['primary_label'] == label]
    tmp = tmp.sample(n=1, axis=0, replace=True)
    audio_per_label[label] = tmp.kaggle_path.values[0]

In [None]:
# We will use Weights and Biases to visualize different spectrograms.
run = wandb.init(project='birdclef', group='EDA')
c = 0
for label, audio_path in audio_per_label.items():
    # Image name
    img_name = audio_path.split('/')[-1].split('.')[0]

    # Load audio
    audio, sr = lb.load(audio_path)

    # Compute spectrogram
    # Short-time Fourier transform (STFT)
    S = np.abs(lb.stft(audio, n_fft=n_fft, hop_length=hop_length))
    # Convert an amplitude spectrogram to Decibels-scaled spectrogram.
    S_to_DB = librosa.amplitude_to_db(S, ref=np.max)

    # Save as spectrogram
    fig = plt.figure(figsize=(16,4))
    lb.display.specshow(S_to_DB, y_axis='linear', sr=sr, hop_length=hop_length,
                       x_axis='time') # Note linear y axis
    plt.xticks([]); plt.yticks([]); plt.xlabel(''); plt.ylabel('');
    plt.savefig(f'{img_name}.png');
    plt.close(fig)

    # log audio data for each label per step. 
    wandb.log({"linear-power-spectrogram": [wandb.Image(f'{img_name}.png', caption=f'{label}')]})
    
    c+=1
    if c==20:
        break

run.finish()

# display W&B run page
run

>  📌 Click on the ⚙️ in the linear-power-spectrogram chart above to visualize the spectrogram per label. Note that I have only logged spectrogram for 20 labels.

#### Spectral Centroid

The spectral centroid indicates at which frequency the energy of a spectrum is centered upon or in other words It indicates where the ” center of mass” for a sound is located.

In [None]:
sample_audio = audio_per_label['caltow']
audio, sr = lb.load(sample_audio)

S = np.abs(lb.stft(audio, n_fft=n_fft, hop_length=hop_length))
S_to_DB = librosa.amplitude_to_db(S, ref=np.max)

cent = lb.feature.spectral_centroid(S=S)
times = lb.times_like(cent)

fig, ax = plt.subplots(figsize=(16,4))
librosa.display.specshow(S_to_DB, y_axis='linear', x_axis='time', ax=ax)
ax.plot(times, cent.T, label='Spectral centroid', color='w')
ax.legend(loc='upper right');
ax.set(title='Log Power spectrogram');

# WORK IN PROGRESS :D