# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">🐦 BirdCLEF 🕊️ - Data and problem investigation</center>
<p><center style="color:#949494; font-family: consolas; font-size: 20px;">BirdCLEF 2023 - Identify bird calls in soundscapes</center></p>

***

# <center style="font-family: consolas; font-size: 32px; font-weight: bold;">(ಠಿ⁠_⁠ಠ) Overview</center>

<p style="font-family: consolas; font-size: 16px;">⚪ The goal of this competition is to use machine learning to <b>identify Eastern African bird species by sound</b>.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The purpose of this is to <b>provide a more cost-effective and logistically feasible method</b> of conducting bird biodiversity surveys, which can be challenging and expensive when done through traditional observer-based methods.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ By using passive acoustic monitoring (PAM) combined with new analytical tools based on machine learning, conservationists can sample much larger spatial scales with higher temporal resolution, allowing for a more comprehensive exploration of the relationship between restoration interventions and biodiversity.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ The best entries in the competition will be able to develop reliable classifiers with limited training data, which will help advance ongoing efforts to protect avian biodiversity in Africa, including those led by the Kenyan conservation organization NATURAL STATE.</p>

#### <a id="top"></a>
# <div style="box-shadow: rgb(60, 121, 245) 0px 0px 0px 3px inset, rgb(255, 255, 255) 10px -10px 0px -3px, rgb(31, 193, 27) 10px -10px, rgb(255, 255, 255) 20px -20px 0px -3px, rgb(255, 217, 19) 20px -20px, rgb(255, 255, 255) 30px -30px 0px -3px, rgb(255, 156, 85) 30px -30px, rgb(255, 255, 255) 40px -40px 0px -3px, rgb(255, 85, 85) 40px -40px; padding:20px; margin-right: 40px; font-size:30px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(60, 121, 245);"><b>Table of contents</b></div>

<div style="background-color: rgba(60, 121, 245, 0.03); padding:30px; font-size:15px; font-family: consolas;">

* [0. Import all dependencies](#0)
* [1. Overview directories](#1)
    * [1.1 Overview train_audio/ directory](#1.1)
    * [1.2 Overview test_soundscapes/ directory](#1.2)
* [2. Overview train_metadata.csv file](#2)
    * [2.1 Check for missing data](#2.1)
    * [2.2 Consider how many classes are present in the training set](#2.2)
    * [2.3 Consider the column secondary labels](#2.3)
    * [2.4 Consider the column type](#2.4)
    * [2.5 Consider the column scientific name](#2.5)
    * [2.6 Consider the columns latitude & longitude](#2.6)
* [3. Overview eBird_Taxonomy_v2021.csv file](#3)
    * [3.1 Check for missing data](#3.1)

<a id="0"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 0. Import all dependencies </b></div>

In [1]:
import os
import random;random.seed(40)
import cv2
import folium
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
from folium.plugins import HeatMap
from folium.features import DivIcon
from IPython.display import Audio, display

In [2]:
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

In [3]:
def display_audio(
    dir_path: str, label: str, example: str
) -> None:
    
    if label == "":
        filename = f"{dir_path}/{example}.ogg"
        label = "None"
    else:
        filename = f"{dir_path}/{label}/{example}.ogg"
    
    print(f"\nLabel - {color.BOLD}{color.PURPLE}{label}{color.END}, example - {example}:")
    display(Audio(filename=filename))

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Overview directories </b></div>

<a id="1.1"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.1 Overview <i>train_audio/</i> directory</b></div>

<p style="font-family: consolas; font-size: 16px;">⚪ The provided training data for this competition includes brief recordings of separate bird calls that have been contributed by users of <a href="https://xeno-canto.org/"><strong>xenocanto.org</strong></a>. To ensure compatibility with the test set audio, these files have been converted to the ogg format and downsampled to 32 kHz where appropriate. It is expected that the training data comprises almost all of the pertinent files, and it is not necessary to search for additional ones on <a href="https://xeno-canto.org/"><strong>xenocanto.org</strong></a>:</p>

<p style="text-align:center;"><img src="https://user-images.githubusercontent.com/45982614/223520408-82b31ee8-3733-4ed6-b62d-46a88b9def3b.png" width="90%" height="90%"></p>



<p style="font-family: consolas; font-size: 16px;">⚪ Number of entries in the directory:</p>

In [4]:
len(os.listdir("/kaggle/input/birdclef-2023/train_audio"))

264

<p style="font-family: consolas; font-size: 16px;">⚪ Let's listen to a few samples:</p>

In [5]:
display_audio("/kaggle/input/birdclef-2023/train_audio", "abethr1", "XC128013")
display_audio("/kaggle/input/birdclef-2023/train_audio", "abhori1", "XC120250")
display_audio("/kaggle/input/birdclef-2023/train_audio", "abythr1", "XC115981")
display_audio("/kaggle/input/birdclef-2023/train_audio", "afbfly1", "XC200995")
display_audio("/kaggle/input/birdclef-2023/train_audio", "afdfly1", "XC115969")


Label - [1m[95mabethr1[0m, example - XC128013:



Label - [1m[95mabhori1[0m, example - XC120250:



Label - [1m[95mabythr1[0m, example - XC115981:



Label - [1m[95mafbfly1[0m, example - XC200995:



Label - [1m[95mafdfly1[0m, example - XC115969:


<a id="1.2"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 1.2 Overview <i>test_soundscapes/</i> directory</b></div>

<p style="font-family: consolas; font-size: 16px;">⚪ When you submit a notebook, the test_soundscapes directory will be populated with approximately 200 recordings to be used for scoring. These recordings are 10 minutes in duration and are saved in the ogg audio format, with their file names randomized. Your submission notebook should take approximately five minutes to load all of the test soundscapes.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ This directory has only one audiofile as an example.</p>

In [6]:
!ls /kaggle/input/birdclef-2023/test_soundscapes

soundscape_29201.ogg


In [7]:
!ls -lh /kaggle/input/birdclef-2023/test_soundscapes/soundscape_29201.ogg

-rw-r--r-- 1 nobody nogroup 4.3M Mar  7 18:09 /kaggle/input/birdclef-2023/test_soundscapes/soundscape_29201.ogg


<p style="font-family: consolas; font-size: 16px;">⚪ Let's listen to it:</p>

In [8]:
display_audio("/kaggle/input/birdclef-2023/test_soundscapes", "", "soundscape_29201")


Label - [1m[95mNone[0m, example - soundscape_29201:


<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2. Overview <i>train_metadata.csv</i> file</b></div>

<p style="font-family: consolas; font-size: 16px;">A wide range of metadata is provided for the training data. The most directly relevant fields are:</p>

* <p style="font-family: consolas; font-size: 16px;"> <b><i><code>primary_label</code></i></b> - a code for the bird species. You can review detailed information about the bird codes by appending the <a href="https://ebird.org/species/"><strong>code</strong></a>, such as <a href="https://ebird.org/species/amecro"><strong>American Crow</strong></a>.</p>
* <p style="font-family: consolas; font-size: 16px;"> <b><i><code>latitude </code></i></b> & <b><i><code>longitude</code></i></b>: coordinates for where the recording was taken. Some bird species may have local call 'dialects,' so you may want to seek geographic diversity in your training data.</p>
* <p style="font-family: consolas; font-size: 16px;"> <b><i><code>author</code></i></b> - The user who provided the recording.</p>
* <p style="font-family: consolas; font-size: 16px;"> <b><i><code>filename</code></i></b>: the name of the associated audio file.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ Read .csv file.</p>

In [9]:
train_metadata_df = pd.read_csv("/kaggle/input/birdclef-2023/train_metadata.csv")

In [10]:
train_metadata_df.head()

Unnamed: 0,primary_label,secondary_labels,type,latitude,longitude,scientific_name,common_name,author,license,rating,url,filename
0,abethr1,[],['song'],4.3906,38.2788,Turdus tephronotus,African Bare-eyed Thrush,Rolf A. de By,Creative Commons Attribution-NonCommercial-Sha...,4.0,https://www.xeno-canto.org/128013,abethr1/XC128013.ogg
1,abethr1,[],['call'],-2.9524,38.2921,Turdus tephronotus,African Bare-eyed Thrush,James Bradley,Creative Commons Attribution-NonCommercial-Sha...,3.5,https://www.xeno-canto.org/363501,abethr1/XC363501.ogg
2,abethr1,[],['song'],-2.9524,38.2921,Turdus tephronotus,African Bare-eyed Thrush,James Bradley,Creative Commons Attribution-NonCommercial-Sha...,3.5,https://www.xeno-canto.org/363502,abethr1/XC363502.ogg
3,abethr1,[],['song'],-2.9524,38.2921,Turdus tephronotus,African Bare-eyed Thrush,James Bradley,Creative Commons Attribution-NonCommercial-Sha...,5.0,https://www.xeno-canto.org/363503,abethr1/XC363503.ogg
4,abethr1,[],"['call', 'song']",-2.9524,38.2921,Turdus tephronotus,African Bare-eyed Thrush,James Bradley,Creative Commons Attribution-NonCommercial-Sha...,4.5,https://www.xeno-canto.org/363504,abethr1/XC363504.ogg


In [11]:
print("Examples count:", len(train_metadata_df))

Examples count: 16941


In [12]:
train_metadata_df.describe()

Unnamed: 0,latitude,longitude,rating
count,16714.0,16714.0,16941.0
mean,12.599897,22.03569,3.727732
std,29.208254,28.743382,1.10106
min,-38.1169,-157.8194,0.0
25%,-6.256,5.941125,3.0
50%,2.3595,26.75065,4.0
75%,42.7871,36.58985,4.5
max,71.9769,177.6849,5.0


<a id="2.1"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.1 Check for missing data</b></div>

In [13]:
train_metadata_df.isnull().sum()

primary_label         0
secondary_labels      0
type                  0
latitude            227
longitude           227
scientific_name       0
common_name           0
author                0
license               0
rating                0
url                   0
filename              0
dtype: int64

<a id="2.2"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.2 Consider how many classes are present in the training set</b></div>

In [14]:
primary_label_counts = train_metadata_df.primary_label.value_counts()

In [15]:
print("Primary labels count:", len(primary_label_counts.index))

Primary labels count: 264


<p style="font-family: consolas; font-size: 16px;">⚪ Let's build a bar plot to see the ratio of the number of instances for each of the classes. Since the number of labels exceeds the plot limit, an interactive graph was built, with which you can fully examine the distribution.</p>

In [16]:
primary_label_counts = train_metadata_df.primary_label.value_counts()

fig = px.bar(x=primary_label_counts.index, y=primary_label_counts.values)
fig.update_layout(xaxis_title="Label", yaxis_title="Count")
fig.show()

<a id="2.3"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.3 Consider the column <code><i>secondary labels</i></code></b></div>

<p style="font-family: consolas; font-size: 16px;"> ⚪ Consider how common the secondary column is.</p>

In [17]:
print("All secondary column occurrences:", sum(train_metadata_df.secondary_labels != "[]"))

All secondary column occurrences: 2305


<p style="font-family: consolas; font-size: 16px;">⚪ Let's combine all secondary labels in one array and plot their distribution. Since the secondary labels are a list that is represented as a string, we can convert the string back to a list using the <b>eval</b> method.</p>

In [18]:
all_secondary_labels = sum([eval(x) for x in train_metadata_df.secondary_labels], [])
all_secondary_labels_counts = pd.value_counts(all_secondary_labels)

In [19]:
fig = px.bar(x=all_secondary_labels_counts.index, y=all_secondary_labels_counts.values)
fig.update_layout(xaxis_title="Secondary Label", yaxis_title="Count")
fig.show()

<a id="2.4"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.4 Consider the column <code><i>type</i></code></b></div>

In [20]:
print("Type column occurrences:", sum(train_metadata_df.type != "[]"))

Type column occurrences: 16941


In [21]:
type_labels = sum([eval(x) for x in train_metadata_df.type], [])
type_counts = pd.value_counts(type_labels)

In [22]:
fig = px.bar(x=type_counts.index, y=type_counts.values)
fig.update_layout(xaxis_title="Audio type", yaxis_title="Count")
fig.show()

<a id="2.5"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.5 Consider the column <code><i>scientific name</i></code></b></div>

In [23]:
scientific_name_counts = train_metadata_df.scientific_name.value_counts()

fig = px.bar(x=scientific_name_counts.index, y=scientific_name_counts.values)
fig.update_layout(xaxis_title="Scientific name", yaxis_title="Count")
fig.show()

<a id="2.6"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 2.6 Consider the columns <code><i>latitude</i></code> & <code><i>longitude</i></code></b></div>

<p style="font-family: consolas; font-size: 16px;">🔴 Each record has data about the place of its creation (its latitude and longitude). Let's visualize all this data on a map using <b>folio</b>.</p>

In [24]:
# As we considired before we have some NaN values in the data, let's drop it
filtered_train_metadata_df = train_metadata_df.dropna()

# Create a folium map object centered on the mean of the latitude and longitude coordinates
map_center = [
    filtered_train_metadata_df.latitude.mean(), 
    filtered_train_metadata_df.longitude.mean()
]
m = folium.Map(location=map_center, zoom_start=4)

# Create a heatmap layer using the latitude and longitude coordinates
heat_data = filtered_train_metadata_df[['latitude', 'longitude']].values.tolist()
HeatMap(heat_data).add_to(m)

# And visualize
m

<p style="font-family: consolas; font-size: 16px;">🔴 Let's combine latitute and longitude with a class label. When plotting the entire dataframe, the visualization lags a lot, so I take a 750 sample from the dataframe.</p>

<p style="font-family: consolas; font-size: 16px;">⚪ Each of the labels has its own unique color, and if you want to know the label on the map, you can simply click on the icon you are interested in and annotation will be shown.</p>

In [25]:
sample_size = 750

# Create a folium map object centered on the mean of the latitude and longitude coordinates
m = folium.Map(location=map_center, zoom_start=4)

# Randomize color for each class label
r = lambda: random.randint(0,255)
color_map = {
    class_label: "#%02X%02X%02X" % (r(),r(),r()) 
    for class_label in filtered_train_metadata_df['primary_label'].unique()
}

# Sample 750 records of the dataframe
# loop through it
# and add a marker for each record
for index, row in filtered_train_metadata_df.sample(n=sample_size).iterrows():
    marker_color = color_map[row["primary_label"]]
    folium.Marker(
        location=[row["latitude"], row["longitude"]], 
        popup=row["primary_label"],
        icon=DivIcon(
            icon_size=(150,36),
            icon_anchor=(7,20),
            html=f'<div class="fa fa-dove" style="font-size: 18pt; color: {marker_color}"></div>',
        )
    ).add_to(m)
    
m

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(51, 51, 51) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. Overview <i>eBird_Taxonomy_v2021.csv</i> file</b></div>

<p style="font-family: consolas; font-size: 16px;">🔴 In this .csv file represented the data on the relationships between different species. This data may be used to identify relationships between different species of birds based on their taxonomic classification.</p>

<p style="font-family: consolas; font-size: 16px;"> Description of the columns:</p>

* <p style="font-family: consolas; font-size: 16px;"> <code>TAXON_ORDER</code>: The taxonomic order of the species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>CATEGORY</code>: The taxonomic category of the species (e.g., species, subspecies, genus, family, etc.)</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>SPECIES_CODE</code>: A unique code assigned to each species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>PRIMARY_COM_NAME</code>: The common name of the species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>SCI_NAME</code>: The scientific name of the species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>ORDER1</code>: The taxonomic order of the species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>FAMILY</code>: The taxonomic family of the species.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>SPECIES_GROUP</code>: The taxonomic group that the species belongs to.</p>
* <p style="font-family: consolas; font-size: 16px;"> <code>REPORT_AS</code>: A code indicating how the species should be reported.</p>

In [26]:
ebt_df = pd.read_csv("/kaggle/input/birdclef-2023/eBird_Taxonomy_v2021.csv")

In [27]:
ebt_df.head()

Unnamed: 0,TAXON_ORDER,CATEGORY,SPECIES_CODE,PRIMARY_COM_NAME,SCI_NAME,ORDER1,FAMILY,SPECIES_GROUP,REPORT_AS
0,1,species,ostric2,Common Ostrich,Struthio camelus,Struthioniformes,Struthionidae (Ostriches),Ostriches,
1,6,species,ostric3,Somali Ostrich,Struthio molybdophanes,Struthioniformes,Struthionidae (Ostriches),,
2,7,slash,y00934,Common/Somali Ostrich,Struthio camelus/molybdophanes,Struthioniformes,Struthionidae (Ostriches),,
3,8,species,grerhe1,Greater Rhea,Rhea americana,Rheiformes,Rheidae (Rheas),Rheas,
4,14,species,lesrhe2,Lesser Rhea,Rhea pennata,Rheiformes,Rheidae (Rheas),,


<p style="font-family: consolas; font-size: 16px;"> ⚪ Let's get the len of this dataframe.</p>

In [28]:
len(ebt_df)

16753

<a id="3.1"></a>
## <div style="box-shadow: rgba(0, 0, 0, 0.18) 0px 2px 4px inset; padding:20px; font-size:24px; font-family: consolas; text-align:center; display:fill; border-radius:15px; color:rgb(67, 66, 66)"> <b> 3.1 Check for missing data</b></div>

In [29]:
ebt_df.isnull().sum()

TAXON_ORDER             0
CATEGORY                0
SPECIES_CODE            0
PRIMARY_COM_NAME        0
SCI_NAME                0
ORDER1                  2
FAMILY                 13
SPECIES_GROUP       16537
REPORT_AS           12877
dtype: int64

# <div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 170, 0.2) -15px 15px inset, rgba(240, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 46, 170, 0.7)"> <b> ༼⁠ ⁠つ⁠ ⁠◕⁠‿⁠◕⁠ ⁠༽⁠つ Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> 💌 Thank you for taking the time to read through my notebook. I hope you found it interesting and informative. If you have any feedback or suggestions for improvement, please don't hesitate to let me know in the comments. <br><br> 🚀 If you liked this notebook, please consider upvoting it so that others can discover it too. Your support means a lot to me, and it helps to motivate me to create more content in the future. <br><br> ❤️ Once again, thank you for your support, and I hope to see you again soon!</p>