<a href="https://www.kaggle.com/code/lonnieqin/bird-species-classification-with-efficientnet?scriptVersionId=122915722" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Bird Species Classification with EfficientNet
## Table of Contents
* Overview
* Import Libraries
* Configuration
* Helper Functions
* Load data
* Exploratory Data Analysis
* Create TensorFlow Dataset
* Model Development
* Model Evaluation
* Create submission file
* Conclusion

## Overview
In this notebook, I will create a Bird Species Classification Model from scratch. I will train this model using [BirdCLEF 2023 competition dataset](https://www.kaggle.com/competitions/birdclef-2023), this dataset contains 16941 audio files of 264 kinds of bird species. This is a audio classification problem, one way to solve this problem is to convert audio files to spectrogram images and build an image classifier. Here are basic steps:
* Load and preprocess sound files using tensorflow-io.
* Randomly sample 5-second sound clip files.
* Convert sound files to spectrogram image with (256, 256, 3) shape.
* Create training and validation TensorFlow dataset.
* Create an image classification model using EfficientNet backbone that can accepts image with shape (n, 256, 256, 3) as input and output probabilities with shape (n, 264).


## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import tensorflow_io as tfio
from IPython.display import Audio
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import sklearn.metrics
import json
import tensorflow as tf
import os
import glob

## Configuration

In [None]:
class CFG:
    image_size = [256, 256]
    is_training = False
    epochs = 10

## Helper Functions

In [None]:
def padded_cmap(solution, submission, padding_factor=5):
    solution = solution.drop(['row_id'], axis=1, errors='ignore')
    submission = submission.drop(['row_id'], axis=1, errors='ignore')
    new_rows = []
    for i in range(padding_factor):
        new_rows.append([1 for i in range(len(solution.columns))])
    new_rows = pd.DataFrame(new_rows)
    new_rows.columns = solution.columns
    padded_solution = pd.concat([solution, new_rows]).reset_index(drop=True).copy()
    padded_submission = pd.concat([submission, new_rows]).reset_index(drop=True).copy()
    score = sklearn.metrics.average_precision_score(
        padded_solution.values,
        padded_submission.values,
        average='macro',
    )
    return score


def preprocess(audio_url, label):
    audio_string = tf.io.read_file(audio_url)
    audio = tfio.audio.decode_vorbis(audio_string)
    audio_tensor = tf.squeeze(audio, axis=[-1])
    diff = tf.cast(tf.shape(audio_tensor)[0] - 5 * 32000, tf.float32)
    begin = tf.cast(tf.random.uniform(shape=()) * diff, tf.int32)
    start_position = tf.where(diff > 0, begin, 0)
    end_position = tf.where(diff > 0, start_position + 5 * 32000, tf.shape(audio_tensor)[0])
    audio_tensor = audio_tensor[start_position:end_position]
    tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
    spectrogram = tfio.audio.spectrogram(tensor, nfft=512, window=512, stride=256)
    spectrogram = tfio.audio.dbscale(spectrogram, top_db=80)
    spectrogram = tf.expand_dims(spectrogram, axis=-1)
    spectrogram = tf.image.resize(spectrogram, CFG.image_size)
    spectrogram = (spectrogram - tf.reduce_min(spectrogram)) / (tf.reduce_max(spectrogram) - tf.reduce_min(spectrogram)) * 255.0
    return spectrogram, label

def preprocess_test(audio_tensor):
    tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
    spectrogram = tfio.audio.spectrogram(tensor, nfft=512, window=512, stride=256)
    spectrogram = tfio.audio.dbscale(spectrogram, top_db=80)
    spectrogram = tf.expand_dims(spectrogram, axis=-1)
    spectrogram = tf.image.resize(spectrogram, CFG.image_size)
    spectrogram = (spectrogram - tf.reduce_min(spectrogram)) / (tf.reduce_max(spectrogram) - tf.reduce_min(spectrogram)) * 255.0
    return tf.expand_dims(spectrogram, axis=0)

def make_dataset(df, batch_size=128, shuffle=True):
    ds = tf.data.Dataset.from_tensor_slices((df["file_path"], df["label"]))
    ds = ds.map(preprocess)
    if shuffle:
        ds = ds.shuffle(batch_size * 4)
    ds = ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds

def make_inference(tensor):
    image = preprocess_test(tensor)
    return model.predict(image)

def frame_audio(
      audio_array: np.ndarray,
      window_size_s: float = 5.0,
      hop_size_s: float = 5.0,
      sample_rate = 32000,
      ) -> np.ndarray:
    
    """Helper function for framing audio for inference."""
    """ using tf.signal """
    if window_size_s is None or window_size_s < 0:
        return audio_array[np.newaxis, :]
    frame_length = int(window_size_s * sample_rate)
    hop_length = int(hop_size_s * sample_rate)
    framed_audio = tf.signal.frame(audio_array, frame_length, hop_length, pad_end=True)
    return framed_audio

def ensure_sample_rate(waveform, original_sample_rate,
                       desired_sample_rate=32000):
    """Resample waveform if required."""
    if original_sample_rate != desired_sample_rate:
        waveform = tfio.audio.resample(waveform, original_sample_rate, desired_sample_rate)
    return desired_sample_rate, waveform

def preprocess_test(audio_tensor):
    tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
    spectrogram = tfio.audio.spectrogram(tensor, nfft=512, window=512, stride=256)
    spectrogram = tfio.audio.dbscale(spectrogram, top_db=80)
    spectrogram = tf.expand_dims(spectrogram, axis=-1)
    spectrogram = tf.image.resize(spectrogram, (256, 256))
    spectrogram = (spectrogram - tf.reduce_min(spectrogram)) / (tf.reduce_max(spectrogram) - tf.reduce_min(spectrogram)) * 255.0
    return spectrogram

def predict_for_sample(filename, sample_submission, frame_limit_secs=None):
    file_id = filename.split(".ogg")[0].split("/")[-1]
    audio = tfio.audio.AudioIOTensor(filename)
    sample_rate = audio.rate.numpy()
    audio_tensor = tf.squeeze(audio[0:], axis=[-1])
    sample_rate, wav_data = ensure_sample_rate(audio_tensor, sample_rate)
    fixed_tm = frame_audio(wav_data)
    frame = 5
    all_logits = make_inference(fixed_tm[:1])
    for window in fixed_tm[1:]:
        if frame_limit_secs and frame > frame_limit_secs:
            continue
        logits = make_inference(window[np.newaxis, :])
        all_logits = np.concatenate([all_logits, logits], axis=0)
        frame += 5
    frame = 5
    all_probabilities = []
    for frame_logits in all_logits:
        probabilities = tf.nn.softmax(frame_logits).numpy()
        ## set the appropriate row in the sample submission
        sample_submission.loc[sample_submission.row_id == file_id + "_" + str(frame), labels] = probabilities
        frame += 5

## Load data

In [None]:
train = pd.read_csv("../input/birdclef-2023/train_metadata.csv")
train.head()

In [None]:
submission = pd.read_csv("../input/birdclef-2023/sample_submission.csv")
submission.head()

In [None]:
labels = list(submission.columns)
labels.remove("row_id")
print(labels)

## Exploratory Data Analysis

There are 264 kinds of birds. Some kind of birds only have 1 sample. It's even challenging to create a Cross Validation Strategy. Before I figure out a better CV strategy, I will start with train validation split with random seed 42.

In [None]:
train.primary_label.value_counts()

In [None]:
train.secondary_labels.value_counts()

In [None]:
train["label"] = train["primary_label"].map(lambda primary_label: labels.index(primary_label))
train.head()

In [None]:
train["file_path"] = train["filename"].apply(lambda filename: os.path.join(f"/kaggle/input/birdclef-2023/train_audio/{filename}"))
train.head()

### Number Of Samples

In [None]:
len(train)

### Create Audio Tensor
Let's create an Audio Tensor and play the sound.

In [None]:
audio = tfio.audio.AudioIOTensor("/kaggle/input/birdclef-2023/train_audio/blakit1/XC115289.ogg")
audio_tensor = tf.squeeze(audio[0:], axis=[-1])
Audio(audio_tensor.numpy(), rate=audio.rate.numpy())

Show this audio clip in graph.

In [None]:
tensor = tf.cast(audio_tensor, tf.float32) / 32768.0
plt.figure()
plt.plot(tensor.numpy())

Show this audio clip to spectrogram.

In [None]:
# Convert to spectrogram
tensor = tf.cast(audio_tensor, tf.float32) 
spectrogram = tfio.audio.spectrogram(tensor, nfft=512, window=512, stride=256)
spectrogram = tf.math.log(spectrogram)
plt.imshow(spectrogram)

The shape of spectrogram of a 5-second audio clip will be about (625, 257), for simplicity I will use (256, 256) as shape of image classification model input.

In [None]:
# Convert to spectrogram
spectrogram = tfio.audio.spectrogram(tensor[0:audio.rate * 5], nfft=512, window=512, stride=256)
spectrogram = tfio.audio.dbscale(spectrogram, top_db=80)

spectrogram = (spectrogram - tf.reduce_min(spectrogram)) / (tf.reduce_max(spectrogram) - tf.reduce_min(spectrogram)) * 255.0
plt.figure()
plt.imshow(spectrogram.numpy())

## Create TensorFlow Dataset

In [None]:
train_df, valid_df = train_test_split(train, test_size=0.2, shuffle=True, random_state=42)

In [None]:
train_df.head()

In [None]:
valid_df.head()

In [None]:
valid_ds = make_dataset(valid_df, shuffle=False)

The input shape and output shape of training data will be (n, 256, 256, 1) and (n). During training, target label will be converted to onehot tensor with 264 classes.

In [None]:
for X, y in valid_ds.take(1):
    print(X.shape, y.shape)

## Model Development

In [None]:
if CFG.is_training:
    train_ds = make_dataset(train_df)
    def get_model():
        inputs = tf.keras.Input(shape=(CFG.image_size[0], CFG.image_size[1], 1))
        image_inputs = tf.concat([
            inputs,
            inputs,
            inputs
        ], axis=-1)
        vector = efficent_net(image_inputs)
        output = tf.keras.layers.Dense(264, activation="softmax")(vector)
        model = tf.keras.Model(inputs=inputs, outputs=output)
        model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(), optimizer=tf.keras.optimizers.Adam(1e-3), metrics=["accuracy"])
        return model
    efficent_net = tf.keras.applications.EfficientNetV2S(include_top=False, pooling="max")
    efficent_net.trainable = False
    efficent_net.summary()
    model = get_model()
    callbacks = [
        tf.keras.callbacks.ModelCheckpoint(
            "model.h5", 
            save_best_only=True
        ),
        tf.keras.callbacks.EarlyStopping(
            min_delta=1e-4, 
            patience=10
        ),
        tf.keras.callbacks.ReduceLROnPlateau(
            factor=0.3,
            patience=2, 
            min_lr=1e-7
        ),
        tf.keras.callbacks.TerminateOnNaN()
    ]
    model.fit(train_ds, epochs=CFG.epochs, validation_data=valid_ds, callbacks=callbacks)
else:
    model = tf.keras.models.load_model("/kaggle/input/bird-clef/model.h5")
model.summary()
tf.keras.utils.plot_model(model, show_shapes=True)

## Model Evaluation

In [None]:
y_preds = model.predict(valid_ds)
y_pred_labels = np.argmax(y_preds, axis=1)

In [None]:
submission_df = pd.DataFrame({"row_id": valid_df.index}).copy()
for i, column in enumerate(labels):
    submission_df[column] = y_preds[:, i]
true_labels = list(valid_df["label"])
solution_df = pd.DataFrame({"row_id": valid_df.index}).copy()
for column in labels:
    solution_df[column] = 0
for i in range(len(valid_df)):
    secondary_labels = valid_df.iloc[i]["secondary_labels"]
    secondary_labels = secondary_labels.replace("\'", "\"")
    arr = json.loads(secondary_labels)
    solution_df.loc[i, labels[true_labels[i]]] = 1
    if len(arr) > 0:
        for secondary_label in arr:
            idx = labels.index(secondary_label)
            if idx >= 0 and idx < len(labels):
                solution_df.loc[i, labels[true_labels[idx]]] = 1
score = padded_cmap(solution_df, submission_df)
print(f"CV:{score}")

## Create Submission file

In [None]:
test_samples = list(glob.glob("/kaggle/input/birdclef-2023/test_soundscapes/*.ogg"))
submission = pd.read_csv("../input/birdclef-2023/sample_submission.csv")
submission[labels] = submission[labels].astype(np.float32)
for filename in test_samples:
    predict_for_sample(filename, submission, frame_limit_secs=15)
submission.to_csv("submission.csv", index=False)
submission.head()

## Conclusion
This Model can achieve about 0.58 CV and 0.71 LB, a little bit lower than the [baseline notebook](https://www.kaggle.com/code/philculliton/inferring-birds-with-kaggle-models), good enough for a notebook written from scratch. There's still a lot of space to improve. For example:
* Create a better cross validation strategy.
* Better way to create spectrogram image.
* Better sampling method.
* Better Neural Architecture and better pretrained model.