# BirdCLEF 2022

### Description:

In this competition, you’ll use your machine learning skills to identify bird species by sound. Specifically, you'll develop a model that can process continuous audio data and then acoustically recognize the species. The best entries will be able to train reliable classifiers with limited training data.

![image info](https://telecoms.com/wp-content/blogs.dir/1/files/2018/03/Angry-Birds-770x285.jpg)

In [None]:
# Beat tracking example
from __future__ import print_function
import librosa
import librosa.display
from scipy.fft import fft, fftfreq
from scipy.signal import blackman

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import mimetypes
mimetypes.init()
mimetypes.add_type('audio/ogg', '.ogg')
import IPython.display as ipd
import plotly.express as px
import pandas as pd
import numpy as np
import sklearn
import ast
import json

sns.set_palette("Reds")


### Table of Contents

* [Load data](#section_1)
* [Analysing Metadata](#section_2)
    * [Primary label](#section_2_1)
    * [Secundary label](#section_2_2)
    * [Type](#section_2_3)
    * [Coordinates](#section_2_4)
    * [Rating](#section_2_5)
    * [Time](#section_2_6)
* [Bird Taxonomy](#section_3)

* [Analysing audio files](#section_4)
    * [Spectrogram](#section_4_1)
    * [MFCC](#section_4_2)




### Load data <a class="anchor" id="section_1"></a>



In [None]:
FILE_PATH = "../input/birdclef-2022/"
train_metadata_df = pd.read_csv(FILE_PATH + "train_metadata.csv")
test_df = pd.read_csv(FILE_PATH + "test.csv")
sample_submission_df = pd.read_csv(FILE_PATH + "sample_submission.csv")
bird_taxo_df = pd.read_csv(FILE_PATH + "eBird_Taxonomy_v2021.csv")
with open(FILE_PATH + "scored_birds.json") as f:
    scored_birds = json.load(f)


### Analysing Metadata <a class="anchor" id="section_2"></a>

train_metadata.csv - A wide range of metadata is provided for the training data. The most directly relevant fields are:

- primary_label - a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.
- secondary_labels: Background species as annotated by the recordist. An empty list does not mean that no background birds are audible.
- author - the eBird user who provided the recording.
- filename: the associated audio file.
- rating: Float value between 0.0 and 5.0 as an indicator of the quality rating on Xeno-canto and the number of background species, where 5.0 is the highest and 1.0 is the lowest. 0.0 means that this recording has no user rating yet.


In [None]:
train_metadata_df.head()

In [None]:
train_metadata_df.shape

In [None]:
train_metadata_df.dtypes


#### Primary label <a class="anchor" id="section_2_1"></a>


First of all, we are going to analysis how many birds species we have abailabel in our dataset. As we can see, we have 152 different species.


In [None]:
train_metadata_df.nunique()

From the graph bellow we can see that the majority species we have are `brnowl`, `skylar`, `norcar`, `mallar3` and so on. On the other hand, we have the species with the lowest number of audio register, such as `hawhaw`,`puaioh`, `bkwpet` and `maupar` that only have a few audio recods. That represent a challenge to train a classification model witch such amount of data. I think that a good idea to deal with such problem it is analysing the lenght of each record and split them in order to generate more file records for those species. we'll see..

In [None]:
sns.set_palette("Reds")

count_birds_df = train_metadata_df.groupby('primary_label').agg(count_birds=('filename','count')).reset_index()\
    .sort_values('count_birds', ascending=False)

top_20_birds = count_birds_df.iloc[:20,:]
bottom_20_birds = count_birds_df.sort_values('count_birds', ascending=True).iloc[:20,:]

fig, ax =plt.subplots(1,2, figsize=(24,8))
sns.barplot(x="count_birds", y="primary_label", orient='h', data=top_20_birds,ax=ax[0])\
        .set_title("Top 20 - Number bird species")

sns.barplot(x="count_birds", y="primary_label", orient='h', data=bottom_20_birds,ax=ax[1])\
        .set_title("Bottom 20 - Number bird species")

fig.show()

From scored birds, only `skylar` and `houfin` are present in the top 20 birds count,  on the contrary, `ercfra`, `hawhaw`, `hawpet1`, `maupar`, `crehon`, and `puaioh` on the bottom 20 birds.

In [None]:
print(set(top_20_birds.primary_label).intersection(scored_birds))
print(set(bottom_20_birds.primary_label).intersection(scored_birds))

Let's see how many register we have for each scored birds. For most of them we only have less than 100 register. What it is worst, some bird's species has 1 register at the most.

In [None]:
sns.set_theme()
scored_birds_count_df = count_birds_df[count_birds_df['primary_label'].isin(scored_birds)].sort_values('count_birds', ascending=False)
plt.figure(figsize=(10,12))
sns.barplot(x="count_birds", y="primary_label", orient='h', data=scored_birds_count_df).set_title("Number of bird species scored")

fig.show()

#### Secundary label <a class="anchor" id="section_2_2"></a>

As it is mentioned in the `train_metadata` 's description at the beginning, there are noises that come from the ambient and other bird species that make our classification problem more dificult. Those secundary species are identified in some register. Let's analize those cases in order to handle them propertly when we're going to train our model.
The total number of cases with this issues are 1478,  which represent nearly ten percent (9.95152168058174 %) of total cases. 

In [None]:
train_metadata_df['count_secondary_species'] = train_metadata_df['secondary_labels'].apply(lambda x : ast.literal_eval(x).__len__())
print("Num. total bird species with secondary species's noise in the audio reg: ", train_metadata_df[train_metadata_df['count_secondary_species']>0].primary_label.count())

`normoc`, `norcar`, `hawcre` and `hourspa` are some of the species that have more audio register with mixed species. Other bird species such as `hawcre`, `crehon`, `akiapo` and `oamo` have the higher mean of mixed species in every record that goes from 2 to 3.5.


In [None]:
sns.set_theme()
count_secespecies_df = train_metadata_df.groupby('primary_label').agg(count_secondary_species=('count_secondary_species','sum')).reset_index()\
    .sort_values('count_secondary_species', ascending=False)

men_secespecies_df = train_metadata_df.groupby('primary_label').agg(mean_secondary_species=('count_secondary_species','mean')).reset_index()\
    .sort_values('mean_secondary_species', ascending=False)

max_secespecies_df = train_metadata_df.groupby('primary_label').agg(max_secondary_species=('count_secondary_species','max')).reset_index()\
    .sort_values('max_secondary_species', ascending=False)

fig, ax =plt.subplots(1,3, figsize=(20,18))
sns.barplot(y="primary_label", x="count_secondary_species", orient='h',   data=count_secespecies_df[count_secespecies_df['count_secondary_species']>0],ax=ax[0])\
    .set_title("Num. secondary species's noise by primary bird species")

sns.barplot(y="primary_label", x="mean_secondary_species", orient='h',   data=men_secespecies_df[men_secespecies_df['mean_secondary_species']>0], ax=ax[1])\
    .set_title("Mean secondary species's noise by primary bird species")

sns.barplot(y="primary_label", x="max_secondary_species", orient='h',   data=max_secespecies_df[max_secespecies_df['max_secondary_species']>0], ax=ax[2])\
    .set_title("Max number secondary species's noise by primary bird species")

fig.show()

#### Type <a class="anchor" id="section_2_3"></a>


There are many types of  bird's sing registered (634), but only a few are significant, like `call`, `song`, `fight call`, `male` and so on. This information can be subjective, because relay on the knowldege and experience from each author. Also, it seems that it require a previous normalization work, because some registers like `call` show up as `calls`, `Call`, `begging call` and so on. <br />  However, it may will be a good idea to group them in each category to process the audio data or develop a model for each one (I am just thinking in loud..)

In [None]:
type_df = (train_metadata_df['type'].apply(lambda x: pd.Series(ast.literal_eval(x)))
            .stack()
            .reset_index(level=1, drop=True)
            .to_frame('type')
            .join(train_metadata_df[['primary_label']], how='left')
            )


sns.set_theme()
count_type_sing_df = type_df.groupby('type').agg(count_type_sing=('primary_label','count')).reset_index()\
    .sort_values('count_type_sing', ascending=False)

top_20_type_sing = count_type_sing_df.iloc[:20,:]

plt.figure(figsize=(8,6))

sns.barplot(x="count_type_sing", y="type", orient='h', data=top_20_type_sing)\
        .set_title(f"Top 20 Type of bird songs -  Total type of bird songs: {type_df.type.nunique()}")



fig.show()

In [None]:
type_df = (train_metadata_df[train_metadata_df['primary_label'].isin(scored_birds)]['type'].apply(lambda x: pd.Series(ast.literal_eval(x)))
            .stack()
            .reset_index(level=1, drop=True)
            .to_frame('type')
            .join(train_metadata_df[train_metadata_df['primary_label'].isin(scored_birds)][['primary_label']], how='left')
            )


sns.set_theme()
count_type_sing_df = type_df.groupby('type').agg(count_type_sing=('primary_label','count')).reset_index()\
    .sort_values('count_type_sing', ascending=False)

top_20_type_sing = count_type_sing_df.iloc[:20,:]

plt.figure(figsize=(8,6))

sns.barplot(x="count_type_sing", y="type", orient='h', data=top_20_type_sing)\
        .set_title(f"Top 20 Type of bird songs -  Total type of bird songs: {type_df.type.nunique()}")



#### Coordinates <a class="anchor" id="section_2_4"></a>

Here there is not much to do, so let's plot every point in the map to have an idea about where most of audio register and birds species came from. 

In [None]:
train_metadata_df['bird_scored'] = np.where(train_metadata_df['primary_label'].isin(scored_birds), 'scored','noscored')

fig = px.scatter_mapbox(train_metadata_df, lat="latitude", lon="longitude", hover_name="primary_label", hover_data=["primary_label", "scientific_name", "common_name", "author"],
                        color="bird_scored", zoom=3, height=300)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()


#### Rating <a class="anchor" id="section_2_5"></a>

As it is mentioned in the description, this indicator represent the quality of the audio (range from 0 to 5). <br>
The graph distribution below shows that  most audio register belong to bines 3 to 5. The boxplot shows the distribution between the top 20 birds, where we can see that each quality audio distribution is subtantially different. <br>

In [None]:
train_metadata_df.rating.describe()

In [None]:
sns.set(style="ticks")


f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, figsize=(10,10),
                                    gridspec_kw={"height_ratios": (.15, .85)})

sns.boxplot(train_metadata_df['rating'], ax=ax_box)
sns.distplot(train_metadata_df['rating'], ax=ax_hist)

ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)

In [None]:
plt.figure(figsize=(16,8))

sns.boxplot(x="primary_label", y="rating",   data=train_metadata_df[train_metadata_df['primary_label'].isin(top_20_birds.primary_label.tolist())])


Let's see how is it look for the scored birds. Rating distribution looks so different for each birds species for these set.

In [None]:
plt.figure(figsize=(16,8))

sns.boxplot(x="primary_label", y="rating",   data=train_metadata_df[train_metadata_df['primary_label'].isin(scored_birds)]).set_title("Boxplot rating by scored birds")


#### Time <a class="anchor" id="section_2_6"></a>

This variable represent the lenght of each audio file. The variable is object type so firt we have to convert into a time value in order to analyze it. From the distribution graph we can see that it has a normal distribution (almost) as it is expected.


In [None]:
train_metadata_df.time.describe()

In [None]:
train_metadata_df['time']  = train_metadata_df['time'].str.extract(r'(\d{2}\:\d{2})') 

def time2second(time):
    if pd.isnull(time):
        return None
        
    else:
        return  float(time.split(':')[0])*60+float(time.split(':')[1])
    
train_metadata_df['time'] = train_metadata_df['time'].apply(lambda x: time2second(x) )

#record_seconds = train_metadata_df[train_metadata_df.time.isnull()!=True].loc[:,'time', 'primary_label'].apply(lambda x: time2second(x[0]) )

In [None]:
train_metadata_df['time'].describe()

In [None]:
sns.set(style="ticks")


f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, figsize=(10,10),
                                    gridspec_kw={"height_ratios": (.15, .85)})

sns.boxplot(train_metadata_df['time'], ax=ax_box)
sns.distplot(train_metadata_df['time'], ax=ax_hist)

ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)

From the next boxplot graph , we can see that audio length has different distribution for each species.

In [None]:
plt.figure(figsize=(16,8))

sns.boxplot(x="primary_label", y="time",   data=train_metadata_df[train_metadata_df['primary_label'].isin(top_20_birds.primary_label.tolist())]).set_title("Boxplot time by top 20 bird species")


Time's distribution looks pretty different from scored bird set.

In [None]:
plt.figure(figsize=(16,8))

sns.boxplot(x="primary_label", y="time",   data=train_metadata_df[train_metadata_df['primary_label'].isin(scored_birds)]).set_title("Boxplot time by scored bird species")


In [None]:
train_metadata_df.iloc[0]

## Bird Taxo <a class="anchor" id="section_3"></a>

Bird_taxo provide a taxonomic categorization of each bird. Let's see which family belong each species of bird from scored group.

In [None]:
bird_taxo_df.head()

In [None]:
bird_taxo_df.shape

In [None]:

taxo_df = train_metadata_df[train_metadata_df['primary_label'].isin(scored_birds)].merge( bird_taxo_df, left_on='primary_label', right_on='SPECIES_CODE', how='inner')
fig = px.parallel_categories(taxo_df[['primary_label', 'SPECIES_GROUP', 'ORDER1','FAMILY']])

fig.show()

### Analysing audio files  <a class="anchor" id="section_4"></a>

Here we are going to analize each audio file from scored bird group. For each sample we plot the raw audio file , spectrogram and MFCC. 

In [None]:
def plot_signal_info(signal, Fs, text=''):
    """1. Prints information about an audio singal, 2. plots the waveform, and 3. Creates player
    
    Args: 
        x: Input signal
        Fs: Sampling rate of x    
        text: Text to print
    """
    print('%s Fs = %d, x.shape = %s, x.dtype = %s' % (text, Fs, signal.shape, signal.dtype))
    fig, axs = plt.subplots(2, 2, figsize=(24, 12))

    N = len(signal)


    time = np.arange(0,N / Fs,1/Fs)

    # plot raw signal
    axs[0, 0].plot(time, signal, color='royalblue')
    axs[0, 0].set_xlabel('Time (Seg)')
    axs[0, 0].set_ylabel('Amplitude')
    axs[0, 0].set_title('Raw signal')


    # plot fourier
    w = blackman(N)
    ywf = fft(signal*w)
    xf = fftfreq(N, 1/Fs)[:N//2]

    axs[1, 0].semilogy(xf[1:N//2], 2.0/N * np.abs(ywf[1:N//2]), '-r')
    axs[1, 0].set_xlabel('Frequency (Hz)')
    axs[1, 0].set_ylabel('Log')
    axs[1, 0].set_title('Fourier Transform (FFT')


    # plot Power spectrogram
    Y = librosa.stft(signal,  n_fft=512, win_length=512,hop_length=220 )
    Ydb = librosa.amplitude_to_db(abs(Y), ref=np.max)
    img = librosa.display.specshow(Ydb, y_axis='log',  sr=Fs, x_axis='time',  ax=axs[0, 1])
    fig.colorbar(img, ax=axs[0, 1], format='%+2.0f dB')
    axs[0, 1].set_title('Power spectrogram (STFT)')
    axs[0, 1].set_xlabel('Time (Seg)')


    # plot Mel Spectogram (1 ~ 14)
    S = librosa.feature.melspectrogram(y=signal, sr=Fs, n_mels=128,
                                   fmax=Fs/2)
    img = librosa.display.specshow(librosa.power_to_db(S[1:14,:], ref=np.max), y_axis='mel', sr=Fs, x_axis='time', ax=axs[1, 1])

    fig.colorbar(img, ax=axs[1, 1], format='%+2.0f dB')
    axs[1, 1].set_title('Mel spectrogram')
    axs[1, 1].set_xlabel('Time (Seg)')


    #axs[0, 0].tight_layout()
    plt.show()
    ipd.display(ipd.Audio(data=signal, rate=Fs))

In [None]:
def load_sound_file(filename):
  
    fpath = FILE_PATH + 'train_audio/' + filename
    return librosa.load(fpath)

### akiapo

In [None]:
idx = 0 
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### aniani

In [None]:
idx = 1
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### apapan

In [None]:
idx = 2
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### barpet

In [None]:
idx = 3
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### crehon

In [None]:
idx = 4
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### elepai

In [None]:
idx = 5
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### ercfra

In [None]:
idx = 6
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### hawama

In [None]:
idx = 7
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### hawcre

In [None]:
idx = 8
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### hawgoo

In [None]:
idx = 9
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### hawhaw

In [None]:
idx = 10
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### hawpet1

In [None]:
idx = 11
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### houfin

In [None]:
idx = 12
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### iiwi

In [None]:
idx = 13
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### jabwar

In [None]:
idx = 14
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### maupar

In [None]:
idx = 15
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### omao

In [None]:
idx = 16
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### puaioh

In [None]:
idx = 17
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### skylar

In [None]:
idx = 18
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')

### warwhe1

In [None]:
idx = 20
bird_sample = train_metadata_df[train_metadata_df['primary_label']==scored_birds[idx]].sample(1)
sound_file_path = bird_sample['filename'].values[0]
signal, fs = load_sound_file(sound_file_path)
bird_name = scored_birds[idx]
plot_signal_info(signal=signal, Fs=fs, text=f'{bird_name} ({sound_file_path}) : ')