

<h1><center>🦉BirdCEF2022🦉</center></h1>

# 1. Introduction

大家好,这是我们对于 BirdCLEF-2022 的数据分析, 思路和版式借鉴了Andrada的
https://www.kaggle.com/code/andradaolteanu/birdcall-recognition-eda-and-audio-fe/notebook



### Libraries 📚⬇

In [None]:
import os

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.image as mpimg
from matplotlib.offsetbox import AnnotationBbox, OffsetImage


import folium

# Map 1 library
import plotly.express as px

# Map 2 libraries
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

# Librosa Libraries
import librosa
import librosa.display
import IPython.display as ipd

import sklearn

import warnings
warnings.filterwarnings('ignore')

# 2. csv都说了些什么呢 📁

## 2.1 train.csv

> 📌**Note**:
* `train.csv` 表示了`train_audio`的统计信息. 总计14852个样本, 13个特征列, 152类小鸟.

* <div class="特征列分析"> 
    <b>特征列分析</b><br>
    训练集中, 每条样本的特征为</p>
    <code>primary_label</code> - (主要的鸟鸣), <code>secondary_label</code> - (次要鸟鸣);</p>
    <code>type</code> - (鸣叫类型/内容);</p>
    <code>latitude</code>, <code>longitude</code> - (录音的经纬度);</p>
    <code>scientific_name</code>, <code>common_name</code> - (鸟类的学名和常用名); <code>author</code> - (录音作者);</p>
    <code>license</code>, <code>rating</code>, <code>time</code>, <code>url</code> - (许可证, 录音等级, 录音时间, 网址);</p>
    <code>filename</code> - (文件名);</p>
</div>

In [None]:
train_csv=pd.read_csv("../input/birdclef-2022/train_metadata.csv")
train_csv.head(3)
#metadata = train_csv['primary_label'].reset_index().explode("primary_label")


## 2.2 test.csv
* 测试集表示test.csv的统计信息, 共3条, 其余在隐藏测试集中.

In [None]:
test_csv = pd.read_csv("../input/birdclef-2022/test.csv")
test_csv

## 2.3 eBird_Taxonomy_v2021.csv
* 是样本鸟类的分类文件</p>
* 以African SilverBill为例


In [None]:
df = pd.read_csv("../input/birdclef-2022/eBird_Taxonomy_v2021.csv")
df[df['PRIMARY_COM_NAME'].isin(['African Silverbill'])]

* TAXON_ORDER - 30031 分类编号
* CATEGORY - species 物种
* SCI_NAME - Euodice cantans https://es.wikipedia.org/wiki/Euodice_cantans
* ORDER1 - Passeriformes 雀形目
* FAMILY - Estrildidae 梅花雀 

## 2.4 scored_bird.json
#### 测试集会出现的小鸟种类

In [None]:
scored_bird = ["akiapo", "aniani", "apapan", "barpet", "crehon", "elepai", "ercfra",
               "hawama", "hawcre", "hawgoo", "hawhaw", "hawpet1", "houfin", "iiwi",
               "jabwar", "maupar", "omao", "puaioh", "skylar", "warwhe1", "yefcan"]
#a=['aniani']
# 如果要分析全部小鸟 请注释掉下面这行代码
#train_csv = train_csv[train_csv["primary_label"].isin(scored_bird)]

#print(train_csv)

## 2.1 来看看歌曲内容吧 🎼



**Type Column**:

> 📌**鸣叫类型/内容**: 这本身就是一个比较难分类的特征:
* **alarm call** is: alarm call | alarm call, call 
* **flight call** is: flight call | call, flight call etc.

In [None]:
# Create a new variable type by exploding all the values
adjusted_type = train_csv['type'].apply(lambda x: x[1:-1].split(',')).reset_index().explode("type")

# Strip of white spaces and convert to lower chars
adjusted_type = adjusted_type['type'].apply(lambda x: x.strip().lower()).reset_index()
#adjusted_type['type'] = adjusted_type['type'].replace('calls', 'call')

# Create Top 15 list with song types
top_15 = list(adjusted_type['type'].value_counts().head(15).reset_index()['index'])
data = adjusted_type[adjusted_type['type'].isin(top_15)]

# === PLOT ===

plt.figure(figsize=(16, 6))
ax = sns.countplot(data['type'], palette="hls", order = data['type'].value_counts().index)

plt.title("Top 15 Song Types", fontsize=16)
plt.ylabel("Frequency", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks(rotation=45, fontsize=13)
plt.xlabel("");

## 按物种的音频样本统计 


In [None]:
# Create a new variable type by exploding all the values
metadata = train_csv['primary_label'].reset_index().explode("primary_label")
#metadata
# Create Top 15 list with species types
top_15 = list(metadata['primary_label'].value_counts().head(8).reset_index()['index'])
#print(top_15)
data = metadata[metadata['primary_label'].isin(top_15)]#这是找出了top15所在的行

print(data['primary_label'].value_counts())
#print(data)
# === PLOT ===
#这个画图有必要吗 看看大致的数据趋势吧
#print(len(metadata["primary_label"].value_counts()))
plt.figure(figsize=(16, 6))
ax = sns.countplot(x = metadata['primary_label'], palette="hls", order = metadata['primary_label'].value_counts().index)

plt.title("Label Counts = "+ str(len(metadata["primary_label"].value_counts())), fontsize=16)
plt.ylabel("Frequency", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks()
#plt.xlabel("");

### 当然 我们往往关心那些较少的数据
#### 我们需要对这几个缺少的数据做增强吗?

In [None]:
# Create Top 15 list with species types
least_15 = list(metadata['primary_label'].value_counts().tail(15).reset_index()['index'])
#print(least_15)
data = metadata[metadata['primary_label'].isin(least_15)]#这是找出了least15所在的行
print(data['primary_label'].value_counts())
# === PLOT ===
plt.figure(figsize=(16, 6))
ax = sns.countplot(x = data['primary_label'], palette="hls", order = data['primary_label'].value_counts().index)

plt.title("Data Counts", fontsize=16)
plt.ylabel("Frequency", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks(rotation=45, fontsize=13)
plt.xlabel("");

## 音频质量

In [None]:
# Create Top 15 list with species types
data_rating = train_csv.loc[:,["primary_label","rating"]].reset_index().explode("rating")
#print(data_rating)
least_15 = list(data_rating['rating'].value_counts().tail(15).reset_index()['index'])
#print(top_15)
data = data_rating[data_rating['rating'].isin(least_15)]#这是找出了least15所在的行
#print(data_rating.loc[data_rating['rating']==0, :])
# === PLOT ===
plt.figure(figsize=(16, 6))
ax = sns.countplot(x = data_rating['rating'], palette="hls")

plt.title("Data Rating Counts", fontsize=16)
plt.ylabel("Frequency", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks(rotation=45, fontsize=13)
plt.xlabel("");

## 录音时间

In [None]:
# Create Top 15 list with species types
data_rating = train_csv['time'].reset_index().explode('time')
freq = '60min'
#print(len(data_rating))
data_rating['time']= pd.to_datetime(data_rating['time'],errors = 'coerce')
data_rating['time'] = data_rating['time'].dt.floor(freq)
data_rating['time'] = data_rating['time'].dt.hour
#print(len(data_rating["time"]))
#print(data_rating)
#least_15 = list(data_rating['time'].value_counts().tail(15).reset_index()['index'])
#print(top_15)
#data = data_rating[data_rating['time'].isin(least_15)]#这是找出了least15所在的行
#print(data_rating.loc[data_rating['time']==0, :])
# === PLOT ===
plt.figure(figsize=(16, 6))
ax = sns.countplot(x = data_rating['time'], palette="hls")

plt.title("Data Rating Counts", fontsize=16)
plt.ylabel("Counts", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks()
plt.xlabel("hours");

## 2.3 研究一下你的小鸟? 📸🔭



## World View of the Species 🧭🌏



### 2.3.1 Where are our birds? 🦜
#### 2.3.1.1 完整数据集 按目分类 总计17个目

In [None]:
# SHP file
world_map = gpd.read_file("../input/world-shapefile/world_shapefile.shp")

# Coordinate reference system
crs = {"init" : "epsg:4326"}
train_species_csv=pd.read_csv("../input//2022-birdcleftrain-data-with-species//train_metadata_with_species.csv")
#请删除接下来这一行
# 就是这一行 train_species_csv= train_species_csv[train_species_csv["primary_label"].isin(scored_bird)]

# Lat and Long need to be of type float, not object
data = train_species_csv[train_species_csv["latitude"] != "Not specified"]
data["latitude"] = data["latitude"].astype(float)
data["longitude"] = data["longitude"].astype(float)

# Create geometry
geometry = [Point(xy) for xy in zip(data["longitude"], data["latitude"])]

# Geo Dataframe
geo_df = gpd.GeoDataFrame(data, crs=crs, geometry=geometry)

# Create ID for species
species_id = geo_df["order1"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))

species_id.columns = ["ID", "order1", "count"]
print(species_id)
# Add ID to geo_df
geo_df = pd.merge(geo_df, species_id, how="left", on="order1")

# === PLOT ===
fig, ax = plt.subplots(figsize = (16, 10))
world_map.plot(ax=ax, alpha=0.4, color="grey")

palette = iter(sns.hls_palette(len(species_id)))

for i in range(17):
    geo_df[geo_df["ID"] == i].plot(ax=ax, markersize=20, color=next(palette), marker="o", label = "test");

#### 2.3.1.2 完整数据集 按科分类 总计41个科

In [None]:
# SHP file
world_map = gpd.read_file("../input/world-shapefile/world_shapefile.shp")

# Coordinate reference system
crs = {"init" : "epsg:4326"}
#train_species_csv=pd.read_csv("../input//2022-birdcleftrain-data-with-species//train_metadata_with_species.csv")
# Lat and Long need to be of type float, not object
data = train_species_csv[train_species_csv["latitude"] != "Not specified"]
data["latitude"] = data["latitude"].astype(float)
data["longitude"] = data["longitude"].astype(float)

# Create geometry
geometry = [Point(xy) for xy in zip(data["longitude"], data["latitude"])]

# Geo Dataframe
geo_df = gpd.GeoDataFrame(data, crs=crs, geometry=geometry)

# Create ID for species
species_id = geo_df["family"].value_counts().reset_index()
species_id.insert(0, 'ID', range(0, 0 + len(species_id)))

species_id.columns = ["ID", "family", "count"]
print(species_id)
# Add ID to geo_df
geo_df = pd.merge(geo_df, species_id, how="left", on="family")

# === PLOT ===
fig, ax = plt.subplots(figsize = (16, 10))
world_map.plot(ax=ax, alpha=0.4, color="grey")

palette = iter(sns.hls_palette(len(species_id)))

for i in range(41):
    geo_df[geo_df["ID"] == i].plot(ax=ax, markersize=20, color=next(palette), marker="o", label = "test");

## 3.3 Listening to some Recordings



### Ok, let's hear some songs! 🕊🎶

In [None]:
# 混音
ipd.Audio("../input/join-voice-of-birdclef/joinVoice.ogg")

# Work in Progress ... ⏳