!["](https://storage.googleapis.com/kaggle-competitions/kaggle/25954/logos/header.png?t=2021-03-19-18-32-57)
> # Complete EDA🔎📊📈 for BirdCLEF 2021 - Birdcall Identification Challenge








---


# Competition Overview

Recent advances in machine listening have improved acoustic data collection. However, it remains a challenge to generate analysis outputs with high precision and recall. The majority of data is unexamined due to a lack of effective tools for efficient and reliable extraction of the signals of interests (e.g., bird calls).

In this competition, you’ll automate the acoustic identification of birds in soundscape recordings. You'll examine an acoustic dataset to build detectors and classifiers to extract the signals of interest (bird calls). Innovative solutions will be able to do so efficiently and reliably.

### **Task:-**
Your challenge in this competition is to identify which birds are calling in long recordings, given training data generated in meaningfully different contexts. For each row_id/time window, you need to provide a space delimited list of the set of unique birds that made a call beginning or ending in that time window. If there are no bird calls in a time window, use the code nocall.

***

# Data Description


#### **train_short_audio -**
The bulk of the training data consists of short recordings of individual bird calls generously uploaded by users of xenocanto.org. These files have been downsampled to 32 kHz where applicable to match the test set audio and converted to the ogg format. The training data should have nearly all relevant files; we expect there is no benefit to looking for more on xenocanto.org.

#### **train_soundscapes -** 
Audio files that are quite comparable to the test set. They are all roughly ten minutes long and in the ogg format. The test set also has soundscapes from the two recording locations represented here.

#### **test_soundscapes -**
When you submit a notebook, the test_soundscapes directory will be populated with approximately 80 recordings to be used for scoring. These will be roughly 10 minutes long and in ogg audio format. The file names include the date the recording was taken, which can be especially useful for identifying migratory birds.

This folder also contains text files with the name and approximate coordinates of the recording location plus a csv with the set of dates the test set soundscapes were recorded.

#### **test.csv -** Only the first three rows are available for download; the full test.csv is in the hidden test set.

**row_id:** ID code for the row.

**site:** Site ID.

**seconds:** the second ending the time window

**audio_id:** ID code for the audio file.

#### **train_metadata.csv -** A wide range of metadata is provided for the training data. The most directly relevant fields are:

**primary_label:** a code for the bird species. You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.

**recodist:** the user who provided the recording.

**latitude & longitude:** coordinates for where the recording was taken. Some bird species may have local call 'dialects,' so you may want to seek geographic diversity in your training data.

**date:** while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data.

**filename:** the name of the associated audio file.

#### **train_soundscape_labels.csv -**

**row_id:** ID code for the row.

**site:** Site ID.

**seconds:** the second ending the time window

**audio_id:** ID code for the audio file.

**birds:** space delimited list of any bird songs present in the 5 second window. The label nocall means that no call occurred.

#### **sample_submission.csv -** A properly formed sample submission file. Only the first three rows are public, the remainder will be provided to your notebook as part of the hidden test set.

**row_id**

**birds:** space delimited list of any bird songs present in the 5 second window. If there are no bird calls, use the label nocall.


***

> ### **If you find this notebook useful, do give me an upvote.👍**

# **Let's Understand the Data**

## ***Train Metadata***
* It specifies the audible species for each recording.
* Consists of information like Primary Label, Secondary Label, Type, Location, Time & Date, Rating etc.

In [None]:
import pandas as pd
import numpy as np

meta = pd.read_csv('../input/birdclef-2021/train_metadata.csv')

In [None]:
meta.head(5)

In [None]:
meta.shape

In [None]:
meta.describe()

In [None]:
meta.isnull().sum()

In [None]:
meta.info()

### **Now, Let's explore each column**

### <font color='red'>Primary Label</font>
* A code for the bird species. 
* You can review detailed information about the bird codes by appending the code to https://ebird.org/species/, such as https://ebird.org/species/amecro for the American Crow.



In [None]:
import plotly.express as px
species = meta['primary_label'].value_counts()
fig = px.bar(species, x=species.index, y='primary_label', labels=dict(x="Species", y="Count"),title = "Bird Species Count")
fig.show()

In [None]:
print(len(species))

In [None]:
cnt = 0
for i in species:
    if i > 300:
        cnt += 1
print(cnt)

* It contain recordings of 397 different primary labels(species).
* As we can see from the graph, it is highly imbalanced training data kind of 'Multitailed Classification'.
* Out of 397 species only 39 species have label count of more than 300

### <font color = 'red'>Secondary Labels</font>
* It contains list of eBird codes (i.e., primary labels) that recordists annotated.
* Can be used for Multi-label training.

In [None]:
meta['secondary_labels'].value_counts()

* Around in 41358 rows, secondary label is not present.
* Majority of recordings do not have an annotation of background species. 
* Yet, it is highly likely that most of them actually contain one or more additional species. 

### <font color = 'red'>Author</font>

In [None]:
meta['author'].nunique()

In [None]:
meta_exp = meta.groupby(['primary_label','author']).size()
meta_exp

Total 2129 authors are there.

### <font color = 'red'>Location</font>

In [None]:
# Code adapted from: https://www.kaggle.com/andradaolteanu/birdcall-recognition-eda-and-audio-fe

import matplotlib.pyplot as plt
import seaborn as sns
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

# SHP file
world_map = gpd.read_file("../input/world-shapefile/world_shapefile.shp")

# Coordinate reference system
crs = {"init" : "epsg:4326"}

# Lat and Long need to be of type float, not object
species_list = ['norcar', 'houspa', 'wesblu', 'banana']
data = meta[meta['primary_label'].isin(species_list)]
data["latitude"] = data["latitude"].astype(float)
data["longitude"] = data["longitude"].astype(float)

# Create geometry
geometry = [Point(xy) for xy in zip(data["longitude"], data["latitude"])]

# Geo Dataframe
geo_df = gpd.GeoDataFrame(data, crs=crs, geometry=geometry)

# Create ID for species
species_id = geo_df["primary_label"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))

species_id.columns = ["ID", "primary_label", "count"]

# Add ID to geo_df
geo_df = pd.merge(geo_df, species_id, how="left", on="primary_label")

# === PLOT ===
fig, ax = plt.subplots(figsize = (16, 10))
world_map.plot(ax=ax, alpha=0.4, color="grey")

palette = iter(sns.hls_palette(len(species_id)))
for i in range(len(species_list)):
    geo_df[geo_df["ID"] == i].plot(ax=ax, 
                                   markersize=20, 
                                   color=next(palette), 
                                   marker="o", 
                                   label = species_id['primary_label'].values[i]);
    
ax.legend()

Location data might be a good feature since It is poosible that certain bird species are from particular regions only.
For example, here :
The Bananaquit (banana) seems to only occur in Central and South America.
House Sparrow (houspa) has occurrences around the globe.

### <font color ='red'>Date</font>

while some bird calls can be made year round, such as an alarm call, some are restricted to a specific season. You may want to seek temporal diversity in your training data.

In [None]:
meta['year'] = meta['date'].apply(lambda x: x.split('-')[0])
meta['month'] = meta['date'].apply(lambda x: x.split('-')[1])
meta['day_of_month'] = meta['date'].apply(lambda x: x.split('-')[2])
meta.head(3)

In [None]:
patterns = pd.DataFrame()
patterns['time'] = pd.to_datetime(meta['time'], errors='coerce')
patterns = patterns.dropna(subset=['time']).reset_index().drop('index',axis=1)
patterns.time = patterns['time'].dt.hour.astype('int')

In [None]:
patterns.time

In [None]:
patterns_type = []
for t in patterns.time:
    if((t>4) and (t<12)):
        patterns_type.append(0)
    elif((t>=12) and (t<16)):
        patterns_type.append(1)
    elif((t>=16) and (t<19)):
        patterns_type.append(2)
    else:
        patterns_type.append(3)

In [None]:
patterns['type'] = patterns_type

In [None]:
patterns

In [None]:
b_types = patterns['type'].value_counts()
b_types

In [None]:
fig = px.bar(patterns, x=b_types.index, y=b_types, title = "Bird Call Time Distribution Graph",)
fig.show()

Here,

0 : Morning

1 : Afternoon

2 : Evening

3 : Night

### <font color = 'red'>Rating</font>

In [None]:
print("Minimum Rating:",min(meta['rating']))
print("Maximum Rating:",max(meta['rating']))


In [None]:
ratings = meta['rating'].value_counts()
fig = px.bar(ratings, x=ratings.index, y='rating', title = "Rating Count")
fig.show()

So most of the recordings have rating greater than 3.0 whcih is good. However, there are around 3.3k recordings which have 0 rating.

In [None]:
meta['type'].value_counts()

In [None]:
type_edit = meta['type'].apply(lambda x: x.replace('[', ''))
type_edit = type_edit.apply(lambda x: x.replace(']', ''))
type_edit = type_edit.apply(lambda x: x.split(',')).reset_index().explode("type")

In [None]:
type_edit['type'].value_counts()

In [None]:
top_10 = list(type_edit['type'].value_counts().head(10).reset_index()['index'])
top_10_freq = list(type_edit['type'].value_counts().head(10))
data = type_edit[type_edit['type'].isin(top_10)]
fig = px.bar(type_edit, x=top_10, y=top_10_freq, title = "Top 10 Call Types")
fig.show()

---

## **Train & Test Data**

In [None]:
train = pd.read_csv('../input/birdclef-2021/train_soundscape_labels.csv')
test = pd.read_csv('../input/birdclef-2021/test.csv')

In [None]:
train

In [None]:
train.info()

In [None]:
sites = train['site'].value_counts()
fig = px.bar(train, x=sites.index, y=sites, title = "Site Distribution")
fig.show()

Both Sites have equal distribution.

In [None]:
types_bird = train['birds'].value_counts()
fig = px.bar(train, x=types_bird.index, y=types_bird)
fig.show()

There is quite large 5 seconds window in recordings where there is no call present.

In [None]:
test

In [None]:
test.info()

Only the first three rows are available ; the full test.csv is in the hidden test set.

---

# Let's Explore Audio Data

### *Librosa vs Scipy*
I have chosen 'Librosa' because It normalizes the data while reading/loading audio file in the range 1 and -1 where as 'scipy' doesn't.

In [None]:
import os
train_short_audio_path = '../input/birdclef-2021/train_short_audio'
audio_count = []
bird_species = []
for i in os.listdir(train_short_audio_path):
    bird_species.append(i)
    audio_cnt = len(os.listdir(train_short_audio_path + os.sep + f"{i}"))
    audio_count.append(audio_cnt)


In [None]:
fig = px.bar(x=bird_species, y=audio_count, title = " Audio Count Distribution")
fig.show()

In [None]:
#Reading Random File

import librosa
audio_file = '../input/birdclef-2021/train_short_audio/astfly/XC118723.ogg'
x , sr = librosa.load(audio_file)

In [None]:
import IPython.display as play
play.Audio(audio_file)

The audio signal is a three-dimensional signal in which three axes represent time, amplitude and frequency.
The data provided of audio cannot be understood by the models directly to convert them into an understandable format feature extraction is used.
librosa.display is used to display the audio files in different formats such as wave plot, spectrogram, or colormap etc. Waveplots let us know the loudness of the audio at a given time.



#### <font color = 'orange'>Waveplots</font>
* Waveplots let us know the loudness of the audio at a given time.
* librosa.display.waveplot is used to plot waveform of amplitude vs time

In [None]:
import matplotlib.pyplot as plt
import librosa.display
plt.figure(figsize=(14, 5))
librosa.display.waveplot(x, sr=sr)


#### <font color = 'orange'>Spectrogram</font>
* A spectrogram is a visual representation of the spectrum of frequencies of sound or other signals as they vary with time. 
* It’s a representation of frequencies changing with respect to time for given music signals.
* .stft() converts data into short term Fourier transform(STFT) so that we can know the amplitude of given frequency at a given time.
* .specshow() is used to display spectogram.

In [None]:
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()

#### <font color = 'orange'>MFCC</font>
* This feature is one of the most important method to extract a feature of an audio signal and is used majorly whenever working on audio signals. 
* The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope.
* .mfcc() is used to calculate mfccs of a signal.
* By printing the shape of mfccs you get how many mfccs are calculated on how many frames. The first value represents the number of mfccs calculated and another value represents a number of frames available.

In [None]:
fs=10
mfccs = librosa.feature.mfcc(x, sr=fs)
print(mfccs.shape)
plt.figure(figsize=(15, 7))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')

#### <font color = 'orange'>Chromagram</font>
* Chromagram closely relates to the twelve different pitch classes. 
* Chroma-based features, which are also referred to as “pitch class profiles”.
* One main property of chroma features is that they capture harmonic and melodic characteristics of music, while being    robust to changes in timbre and instrumentation.

In [None]:
hop_length=12
chromagram = librosa.feature.chroma_stft(x, sr=sr, hop_length=hop_length)
plt.figure(figsize=(15, 5))
librosa.display.specshow(chromagram, x_axis='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm')

### <font color = 'green'>**Refrences**</font>

* https://www.kaggle.com/andradaolteanu/birdcall-recognition-eda-and-audio-fe
* https://www.kaggle.com/stefankahl/birdclef2021-exploring-the-data

## If you find this notebook useful, do give me an upvote 👍.

## This notebook will be updated frequently so keep checking for further developments.

## In case of any doubts reach out to me on [LinkedIn](https://www.linkedin.com/in/rajendra-sarpal-rs465/).