# BirdCLEF - Exploratory Data Analysis
This notebook will go through some preliminary EDA of the BirdCLEF data, such as looking at the data, hearing the audio voices, plus some other analyses. At the end, we hope to have a better idea of what kinds of preprocessing will be helpful. The Kaggle competition stated that training data in here is hard to come by, so we will see what they mean.

To recap, the goal of the project is to **predict whether a particular soundbite contains a specified bird call or not.** Therefore, the answer will be True/False. Another thing to keep in mind is in addition to the primary label, we also have data on background secondary labels i.e. whether another bird call was present in the background of the primary bird's recording. This forcus us to train the model as a multi-label classification approach, where we predict the chances of multiple birds for each recording.

## Import Pacakges

In [16]:
import numpy as np
import pandas as pd
import altair as alt

DATA_PATH = '../Data'
ALTAIR_JSONS = '../resources/altair_jsons/'
alt.renderers.enable('altair_viewer', inline=True)  # So that graphs appear in the notebook.

get_json_filepath = lambda x: os.path.join(ALTAIR_JSONS, x)

## Reading Data
We have many files that all amount to around 6 GB. Instead of reading them all, we will get their file paths and read them as we need. We will also get the spreadsheet which tells us which birds are in there.

In [8]:
train_metadata = pd.read_csv(os.path.join(DATA_PATH, 'train_metadata.csv'))
train_metadata.head()

Unnamed: 0,primary_label,secondary_labels,type,latitude,longitude,scientific_name,common_name,author,license,rating,time,url,filename
0,afrsil1,[],"['call', 'flight call']",12.391,-1.493,Euodice cantans,African Silverbill,Bram Piot,Creative Commons Attribution-NonCommercial-Sha...,2.5,08:00,https://www.xeno-canto.org/125458,afrsil1/XC125458.ogg
1,afrsil1,"['houspa', 'redava', 'zebdov']",['call'],19.8801,-155.7254,Euodice cantans,African Silverbill,Dan Lane,Creative Commons Attribution-NonCommercial-Sha...,3.5,08:30,https://www.xeno-canto.org/175522,afrsil1/XC175522.ogg
2,afrsil1,[],"['call', 'song']",16.2901,-16.0321,Euodice cantans,African Silverbill,Bram Piot,Creative Commons Attribution-NonCommercial-Sha...,4.0,11:30,https://www.xeno-canto.org/177993,afrsil1/XC177993.ogg
3,afrsil1,[],"['alarm call', 'call']",17.0922,54.2958,Euodice cantans,African Silverbill,Oscar Campbell,Creative Commons Attribution-NonCommercial-Sha...,4.0,11:00,https://www.xeno-canto.org/205893,afrsil1/XC205893.ogg
4,afrsil1,[],['flight call'],21.4581,-157.7252,Euodice cantans,African Silverbill,Ross Gallardy,Creative Commons Attribution-NonCommercial-Sha...,3.0,16:30,https://www.xeno-canto.org/207431,afrsil1/XC207431.ogg


## Bird Frequencies
We will do a couple things here, such as seeing how many birds we have, and also how varied our secondary labels are. We will use Altair for plotting, due to that I will save the JSONs for the data in the resources/altair_jsons folder. This is so plotting is quick.
### Primary Label Counts

In [18]:
# Count up the primary labels
unique_labels, counts = np.unique(train_metadata.primary_label, return_counts=True)
# Create the dataframe and the 
df_filepath = get_json_filepath('bird_counts.json')
df = pd.DataFrame(data=np.array([unique_labels, counts]).T, columns=['Bird', 'Count'])
df.to_json(df_filepath, orient='records')

# Plot it!
chart = alt.Chart(df_filepath).mark_bar().encode(
    x='Bird:N',
    y='Count:Q'
)
chart.display()

The scale goes from 0 to 500. We can see that the counts are heavily varied. Some birds have 500 recordings, while others are in the single digits. This would make it hard to train on those low frequency birds. What about the secondary labels? How do those vary? 
### Secondary Label Counts
We need a little bit more preproecssing to get the secondary label counts.

In [36]:
# The secondary label column is an array of which birds are in the background. 
# We can just join these arrays together and do the same counts.
# Unfortunately, each entry column isn't a list, it's a string, so we need to parse it...
secondary_label_lists = train_metadata.secondary_labels.str.strip('][')
# Get rid of the empty strings
secondary_label_lists = secondary_label_lists[secondary_label_lists != '']
# Convert to list of lists
secondary_label_lists = secondary_label_lists.str.split(', ').to_list()
# Each bird is actually surrounded by both single and double quotes, so get rid of a set as well
secondary_label_lists = [bird[1:-1] for sublist in secondary_label_lists for bird in sublist]

print(f'There are {len(secondary_label_lists)} background bird calls.')
print(secondary_label_lists[:5])

There are 1922 background bird calls.
['houspa', 'redava', 'zebdov', 'apapan', 'warwhe1']


In [37]:
# Next, get the unique birds and counts as before and plot them...
unique_labels, counts = np.unique(secondary_label_lists, return_counts=True)
# Create dataframe and the json file
df_filepath = get_json_filepath('secondary_bird_counts.json')
df = pd.DataFrame(data=np.array([unique_labels, counts]).T, columns=['Background Bird', 'Count'])
df.to_json(df_filepath, orient='records')

# Plot it!
chart = alt.Chart(df_filepath).mark_bar().encode(
    x='Background Bird:N',
    y='Count:Q'
)
chart.display()

Like with the primary birds, the counts are also varied here, although the maximum value is much less, due to not every recording having a background bird call. Seeing both of these graphs, we must do some sort of data augmentation to increase our training data eventually. Thankfully though, since both primary and secondary labels have similar shape, we shouldn't need to do change our approach depending on which type of call.