# <p style="background-color:#f3ab60;font-family:newtimeroman;color:#662e2e;font-size:130%;text-align:center;border-radius:40px 40px;">BirdCLEF 2022</p>

<h1 align='center'>Introduction 📝</h1>
The goal of the competition is to identify which birds are calling in the recordings. This notebook will be helpful for all the begineers who have very little to no knowledge in this domain. In this kernel I will briefly go through the metadata and audio data with some quick EDA. Then I will focus on the main part of the kernel which is the data processing and training audio based data model using pytorch. 

This is a work in progress notebook and I will keep on updating it as I learn more (as I have also participated in the audio based competition for the first time😅)

##  <font color="red"> Please do an upvote if you find this kernel useful.</font>

<h1 align='center'>Table of Contents 📜</h1>
<ul style="list-style-type:square">
    <li><a href="#1">Importing Libraries</a></li>
    <li><a href="#2">Reading the data</a></li>
    <li><a href="#3">Quick EDA</a></li>
    <ul style="list-style-type:disc">
        <li><a href="#3.1">Train_Metadata</a></li>
        <li><a href="#3.2">Audio Files</a></li>
    </ul>
    <li><a href="#4">Data Preprocessing</a></li>
    <li><a href="#5">Model</a></li>
    <li><a href="#6">Utility Functions</a></li>
    <li><a href="#7">Training</a></li>
</ul>



<a id='1'></a>
# Importing Libraries 📚

In [None]:
import os
import gc
import ast
import random
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from tqdm import tqdm
import torchaudio
import IPython.display as ipd
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.utils import class_weight

import torch
import torch.nn as nn
from torch.optim import Adam
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import models

import warnings
warnings.filterwarnings('ignore')

In [None]:
class config:
    seed=2022
    num_fold = 9
    sample_rate= 32_000
    n_fft=1024
    hop_length=512
    n_mels=64
    duration=5
    num_classes = 152
    train_batch_size = 32
    valid_batch_size = 64
    model_name = 'resnet50'
    epochs = 2
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    learning_rate = 1e-3

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(config.seed)

<a id='2'></a>
# Reading the data 📖

In [None]:
df = pd.read_csv('../input/birdclef-2022/train_metadata.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

<a id='3'></a>
# Quick EDA 📊

<a id='3.1'></a>
## Analyse Train_Metadata

In [None]:
plt.figure(figsize=(20, 6))

sns.countplot(df['primary_label'])
plt.xticks(rotation=90)
plt.title("Distribution of Primary Labels", fontsize=20)

plt.show()

In [None]:
plt.figure(figsize=(20, 6))

sns.countplot(df['rating'])
plt.title("Distribution of Ratings", fontsize=20)

plt.show()

In [None]:
df['type'] = df['type'].apply(lambda x : ast.literal_eval(x))

top = Counter([typ.lower() for lst in df['type'] for typ in lst])

top = dict(top.most_common(10))

plt.figure(figsize=(20, 6))

sns.barplot(x=list(top.keys()), y=list(top.values()), palette='hls')
plt.title("Top 10 song types")

plt.show()

<a id='3.2'></a>
## Analyse Audio Files

### Let's listen few audios

In [None]:
filename_1 = df["filename"].values[0] # first training example
ipd.Audio(f"../input/birdclef-2022/train_audio/{filename_1}")

In [None]:
filename_2 = df["filename"].values[-1] # last training example
ipd.Audio(f"../input/birdclef-2022/train_audio/{filename_2}")

### Now let us load the the audio and plot the waveform.
<b>Note - I will be using Torchaudio(which is a library for audio with PyTorch) for processing audio data.</b><br>
<center>
<img src = "https://torch.mlverse.org/css/images/hex/torchaudio.png" style="width:200px;height:200px"><br>
<a href="https://pytorch.org/audio/stable/index.html">TORCHAUDIO DOCUMENTATION</a>    
</center>

[](http://)

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(20, 10))
fig.suptitle("Sound Waves", fontsize=15)

signal_1, sr = torchaudio.load(f"../input/birdclef-2022/train_audio/{filename_1}")
# The audio data consist of two things-
# Sound: sequence of vibrations in varying pressure strengths (y)
# Sample Rate: (sr) is the number of samples of audio carried per second, measured in Hz or kHz

sns.lineplot(x=np.arange(len(signal_1[0,:].detach().numpy())), y=signal_1[0,:].detach().numpy(), ax=ax[0], color='#4400FF')
ax[0].set_title("Audio 1")

signal_2, sr = torchaudio.load(f"../input/birdclef-2022/train_audio/{filename_2}")
sns.lineplot(x=np.arange(len(signal_2[0,:].detach().numpy())), y=signal_2[0,:].detach().numpy(), ax=ax[1], color='#4400FF')
ax[1].set_title("Audio 2")

plt.show()

<a id='4'></a>
# Dataset Preprocessing 🛠️

### First of all, as our target variable is in string format, we have to convert it to integer and here I have used LabelEncoder to perform this work.

In [None]:
encoder = LabelEncoder()
df['primary_label_encoded'] = encoder.fit_transform(df['primary_label'])

y = torch.FloatTensor(df['primary_label_encoded'])

class_weights=class_weight.compute_class_weight(class_weight = 'balanced',classes = np.unique(y), y = y.numpy())
class_weights=torch.tensor(class_weights,dtype=torch.float)

class_weights = class_weights.to(config.device)

In [None]:
bird_labels = df['primary_label'].unique()


### Next we created folds.

In [None]:
skf = StratifiedKFold(n_splits=config.num_fold)
for k, (_, val_ind) in enumerate(skf.split(X=df, y=df['primary_label_encoded'])):
    df.loc[val_ind, 'fold'] = k

Now we will focus on our input variable. Our input in this are audio files and these audios cannot be understood by the models directly. So to use them, we convert it into an understandable format by performing some type of feature extraction technique.

## Feature Extraction
There are several different feature extractions in audio processing but I will not cover all those in this notebook. Genereally, the features that are extracted are in the form of images which we then use them to train our model. <br>
I would recommed this playlist for audio processing to understand the basics - https://www.youtube.com/playlist?list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0 <br>
Here I will be extracting MelSpectrogram which is a type of spectrogram where the frequencies are converted to the mel scale.

### Now let us look at the Mel Spectrogram for the audio loaded during the EDA.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(20, 7))
fig.suptitle("Mel Spectrogram", fontsize=15)

mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=config.sample_rate, 
                                                      n_fft=config.n_fft, 
                                                      hop_length=config.hop_length, 
                                                      n_mels=config.n_mels)

mel_1 = mel_spectrogram(signal_1)
ax[0].imshow(mel_1.log2()[0,:,:].detach().numpy(), aspect='auto', cmap='cool')
ax[0].set_title("Audio 1")

mel_2 = mel_spectrogram(signal_2)
ax[1].imshow(mel_2.log2()[0,:,:].detach().numpy(), aspect='auto', cmap='cool')
ax[1].set_title("Audio 2")

plt.show()

### So similarly we will extract mel spectrogram for each audio and will train the model using them. But wait, this is not the end. There are several things which we need to consider before extracting spectrograms from the audio files. We want our dataset to be uniform and to do that we should consider the below points:-
* As I mentioned above the audio data consist of two things - sample rate and sound. Not all the audio have same sample rate, and this is a huge problem if we want uniformity in the melspectrogram which we extract. So we resample the data so that all the data have same sample rates.
* Next if we talk about the sound, the dimension of sound is - (num_channels, num_samples). If we talk about number of channels, then each audio signals can have different number of channels. So we will ensure that they are mono, i.e., num_channels = 1.
* Lastly, each audio signal have different time durations which lead to difference in number of samples. So we ensure same number of samples by applying padding if it is less than the desired samples or by truncating if it is more than the desired samples.

### Now I will implement the custom Dataset class in which I will also implement all the above points. 

In [None]:
class BirdClefDataset(Dataset):
    def __init__(self, df, transformation, target_sample_rate, duration):
        self.audio_paths = df['filename'].values
        self.labels = df['primary_label_encoded'].values
        self.transformation = transformation
        self.target_sample_rate = target_sample_rate
        self.num_samples = target_sample_rate*duration
        
    def __len__(self):
        return len(self.audio_paths)
    
    def __getitem__(self, index):
        audio_path = f'../input/birdclef-2022/train_audio/{self.audio_paths[index]}'
        signal, sr = torchaudio.load(audio_path) # loaded the audio
        
        # Now we first checked if the sample rate is same as TARGET_SAMPLE_RATE and if it not equal we perform resampling
        if sr != self.target_sample_rate:
            resampler = torchaudio.transforms.Resample(sr, self.target_sample_rate)
            signal = resampler(signal)
        
        # Next we check the number of channels of the signal
        #signal -> (num_channels, num_samples) - Eg.-(2, 14000) -> (1, 14000)
        if signal.shape[0]>1:
            signal = torch.mean(signal, axis=0, keepdim=True)
        
        # Lastly we check the number of samples of the signal
        #signal -> (num_channels, num_samples) - Eg.-(1, 14000) -> (1, self.num_samples)
        # If it is more than the required number of samples, we truncate the signal
        if signal.shape[1] > self.num_samples:
            signal = signal[:, :self.num_samples]
        
        # If it is less than the required number of samples, we pad the signal
        if signal.shape[1]<self.num_samples:
            num_missing_samples = self.num_samples - signal.shape[1]
            last_dim_padding = (0, num_missing_samples)
            signal = F.pad(signal, last_dim_padding)
        
        # Finally all the process has been done and now we will extract mel spectrogram from the signal
        mel = self.transformation(signal)
        
        # For pretrained models, we need 3 channel image, so for that we concatenate the extracted mel
        image = torch.cat([mel, mel, mel])
        
        # Normalized the image
        max_val = torch.abs(image).max()
        image = image / max_val
        
        label = torch.tensor(self.labels[index])
        
        return image, label

In [None]:
# Function to get data according to the folds
def get_data(fold):
    train_df = df[df['fold'] != fold].reset_index(drop=True)
    valid_df = df[df['fold'] == fold].reset_index(drop=True)
    
    train_dataset = BirdClefDataset(train_df, mel_spectrogram, config.sample_rate, config.duration)
    valid_dataset = BirdClefDataset(valid_df, mel_spectrogram, config.sample_rate, config.duration)
    
    train_loader = DataLoader(train_dataset, batch_size=config.train_batch_size, shuffle=True)
    valid_loader = DataLoader(valid_dataset, batch_size=config.valid_batch_size, shuffle=False)
    
    return train_loader, valid_loader

<a id='5'></a>
# Model 🤖

### So I will first start with a custom CNN model. After that, we will the see the usage of pretrained and other advanced models.

In [None]:
class BirdClefModel(nn.Module):
    def __init__(self):
        super(BirdClefModel, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(128*8*39, 64)
        self.fc2 = nn.Linear(64, config.num_classes)
        #self.softmax = nn.Softmax(dim = None)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = F.relu(self.conv3(x))
        x = self.pool3(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        #x = self.softmax(x)
        
        return x

### Now we will fine tune a pretrained model. Here I have used Resnet50. You can use any pretrained model and do experiments.

In [None]:
class BirdCLEFResnet(nn.Module):
    def __init__(self):
        super(BirdCLEFResnet, self).__init__()
        self.base_model = models.__getattribute__(config.model_name)(pretrained=True)
        for param in self.base_model.parameters():
            param.requires_grad = False
            
        in_features = self.base_model.fc.in_features
        
        self.base_model.fc = nn.Sequential(
            nn.Linear(in_features, 1024), 
            nn.ReLU(), 
            nn.Dropout(p=0.2),
            nn.Linear(1024, 512), 
            nn.ReLU(), 
            nn.Dropout(p=0.2),
            nn.Linear(512, config.num_classes))
        
    def forward(self, x):
        x = self.base_model(x)
        return x

<a id='6'></a>
# Utility Functions 📋

### Next we define some functions to train the model. These are the basic functions which we use to train any pytorch based models.

In [None]:
def loss_fn(outputs, labels):
    return nn.CrossEntropyLoss(weight = class_weights)(outputs, labels)

def train(model, data_loader, optimizer, scheduler, device, epoch):
    model.train()
    
    running_loss = 0
    loop = tqdm(data_loader, position=0)
    for i, (mels, labels) in enumerate(loop):
        mels = mels.to(device)
        labels = labels.to(device)
        
        outputs = model(mels)
        _, preds = torch.max(outputs, 1)
        
        loss = loss_fn(outputs, labels)
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        if scheduler is not None:
            scheduler.step()
            
        running_loss += loss.item()
        
        loop.set_description(f"Epoch [{epoch+1}/{config.epochs}]")
        loop.set_postfix(loss=loss.item())

    return running_loss/len(data_loader)

In [None]:
def valid(model, data_loader, device, epoch):
    model.eval()
    
    running_loss = 0
    pred = []
    label = []
    
    loop = tqdm(data_loader, position=0)
    for mels, labels in loop:
        mels = mels.to(device)
        labels = labels.to(device)
        
        outputs = model(mels)
        _, preds = torch.max(outputs, 1)
        
        loss = loss_fn(outputs, labels)
            
        running_loss += loss.item()
        
        pred.extend(preds.view(-1).cpu().detach().numpy())
        label.extend(labels.view(-1).cpu().detach().numpy())
        
        loop.set_description(f"Epoch [{epoch+1}/{config.epochs}]")
        loop.set_postfix(loss=loss.item())
        
    valid_f1 = f1_score(label, pred, average='macro')
    
    return running_loss/len(data_loader), valid_f1

In [None]:
#checkpoint = {'model': Classifier(),
          #'state_dict': model.state_dict(),
          #'optimizer' : optimizer.state_dict()}

#torch.save(checkpoint, 'checkpoint.pth')

def load_checkpoint(filepath):
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    #epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    

    model.train()
    return model



In [None]:
model = BirdClefModel().to(config.device) # check version 3 for this
#model = BirdCLEFResnet().to(config.device)

optimizer = Adam(model.parameters(), lr=config.learning_rate)

checkpoint = {'model': model, 'state_dict': model.state_dict(), 'optimizer' : optimizer.state_dict(), 'loss' : 0.5}
torch.save(checkpoint, f'./model_0.bin')

def run(fold):
    train_loader, valid_loader = get_data(fold)
    
    
    
    #model = torch.load(f'./model_{fold}.bin')
    #model.load_state_dict(torch.load(f'./model_{fold}.bin'))
    model = load_checkpoint(f'./model_{fold}.bin')
    
    
    
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, eta_min=1e-5, T_max=10)
    
    best_valid_f1 = 0
    for epoch in range(config.epochs):
        train_loss = train(model, train_loader, optimizer, scheduler, config.device, epoch)
        valid_loss, valid_f1 = valid(model, valid_loader, config.device, epoch)
        if valid_f1 > best_valid_f1:
            print(f"Validation F1 Improved - {best_valid_f1} ---> {valid_f1}")
            checkpoint = {'model': model, 'state_dict': model.state_dict(), 'optimizer' : optimizer.state_dict(), 'loss' : valid_loss}
            torch.save(checkpoint, f'./model_{fold + 1}.bin')
            print(f"Saved model checkpoint at ./model_{fold + 1}.bin")
            best_valid_f1 = valid_f1
            
    return best_valid_f1

<a id='7'></a>
# Training ⚙️

In [None]:
for fold in range(config.num_fold):
    print("=" * 30)
    print("Training Fold - ", fold)
    print("=" * 30)
    best_valid_f1 = run(fold)
    print(f'Best F1 Score: {best_valid_f1:.5f}')
    
    gc.collect()
    torch.cuda.empty_cache()    
    

**TESTING**

In [None]:
import json

TEST_AUDIO_PATH = '../input/birdclef-2022/test_soundscapes/'

with open('../input/birdclef-2022/scored_birds.json') as fp:
    SCORED_BIRDS = json.load(fp)

In [None]:
import math
def create_df_test_from_path():
    files = sorted(os.listdir(TEST_AUDIO_PATH))
    data = []
    submission = []
    for f in files:
        wv, sr = torchaudio.load(TEST_AUDIO_PATH + f)
        n_chunks = math.ceil(len(wv[0]) / sr / 5)
        filename = f
        row_prefix = f[:-4]
        bird = SCORED_BIRDS[0]
        for bird in SCORED_BIRDS:
            for chunk in range(1, n_chunks + 1):
            
                row_id = f"{f[:-4]}_{bird}_{chunk*5}"
            
                ending_second = chunk*5
                submission.append((filename, row_prefix, ending_second, [bird]))
            
        for chunk in range(1, n_chunks + 1):
            
            ending_second = chunk*5
            data.append((filename, row_prefix, ending_second))    
            
            
    return  pd.DataFrame(submission, columns=['filename', 'row_prefix', 'ending_second', 'birds']), pd.DataFrame(data, columns=['filename', 'row_prefix', 'ending_second'])
        
submission_df, test_df = create_df_test_from_path()

In [None]:
class TestDataset(Dataset):
    def __init__(self, df, transformation, target_sample_rate, duration):
        self.audio_paths = df['filename'].values
        #self.labels = df['birds'].values
        self.transformation = transformation
        self.target_sample_rate = target_sample_rate
        self.num_samples = target_sample_rate*duration
        self.end_sample = df['ending_second'].values * target_sample_rate
        
    def __len__(self):
        return len(self.audio_paths)
    
    def __getitem__(self, index):
        audio_path = f'../input/birdclef-2022/test_soundscapes/{self.audio_paths[index]}'
        signal, sr = torchaudio.load(audio_path) # loaded the audio
        
        # Now we first checked if the sample rate is same as TARGET_SAMPLE_RATE and if it not equal we perform resampling
        if sr != self.target_sample_rate:
            resampler = torchaudio.transforms.Resample(sr, self.target_sample_rate)
            signal = resampler(signal)
        
        # Next we check the number of channels of the signal
        #signal -> (num_channels, num_samples) - Eg.-(2, 14000) -> (1, 14000)
        if signal.shape[0]>1:
            signal = torch.mean(signal, axis=0, keepdim=True)
        
        # Seperate the 5 second chunk we want from the signal
        signal = signal[:, (self.end_sample[index]-self.num_samples):self.end_sample[index]]
        
        # Lastly we check the number of samples of the signal
        #signal -> (num_channels, num_samples) - Eg.-(1, 14000) -> (1, self.num_samples)
        # If it is more than the required number of samples, we truncate the signal
        if signal.shape[1] > self.num_samples:
            signal = signal[:, :self.num_samples]
        
        # If it is less than the required number of samples, we pad the signal
        if signal.shape[1]<self.num_samples:
            num_missing_samples = self.num_samples - signal.shape[1]
            last_dim_padding = (0, num_missing_samples)
            signal = F.pad(signal, last_dim_padding)
        
        # Finally all the process has been done and now we will extract mel spectrogram from the signal
        mel = self.transformation(signal)
        
        # For pretrained models, we need 3 channel image, so for that we concatenate the extracted mel
        image = torch.cat([mel, mel, mel])
        
        # Normalized the image
        max_val = torch.abs(image).max()
        image = image / max_val
        
        #label = torch.tensor(self.labels[index])
        
        return image

In [None]:
def test(model, data_loader, device):
    model.eval()
    
    
    pred = []
    #label = []
    
    loop = tqdm(data_loader, position=0)
    for mels in loop:
        mels = mels.to(device)
        #labels = labels.to(device)
        
        outputs = model(mels)
        _, preds = torch.max(outputs, 1)
        
    
    return preds, outputs

In [None]:
# Load Test data
test_dataset = TestDataset(test_df, mel_spectrogram, config.sample_rate, config.duration)

test_loader = DataLoader(test_dataset, batch_size=config.train_batch_size, shuffle=False)

In [None]:
submission_df

In [None]:
test_df

In [None]:
def load_test_checkpoint(filepath):
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    #epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    

    model.eval()
    return model

In [None]:
#model = BirdCLEFResnet().to(config.device)
model = load_test_checkpoint( f'./model_{config.num_fold}.bin')
preds, outputs = test(model, test_loader, config.device)

outputs = nn.Softmax(dim = None)(outputs)

In [None]:
predictions = []
for idx in range(len(submission_df)):
    
    p = preds[int((submission_df.iloc[idx,2] / 5) - 1)]
    
    row_id = submission_df.iloc[idx,1] +'_'+ submission_df.iloc[idx,3][0] + '_' + str(submission_df.iloc[idx,2])
    if (bird_labels[p] == submission_df.iloc[idx,3][0] and outputs[int((submission_df.iloc[idx,2] / 5) - 1)][p] > 0.0):
        predictions.append([row_id, True])
    else:
        predictions.append([row_id, False])

In [None]:
predictions_df = pd.DataFrame(predictions,columns=['row_id', 'target'])

In [None]:
predictions_df.to_csv('submission.csv', index=False)

In [None]:
predictions_df