# Scraping Xeno Canto

This notebook is used to scrape the metadata for the birds in our dataset. 
We will use the Xeno Canto API to get the metadata for the birds in our dataset.

In [1]:
import itertools
import json
from pathlib import Path

import pandas as pd

from typing import TypeAlias
from collections.abc import Mapping, Sequence

JSON: TypeAlias = Mapping[str, "JSON"] | Sequence["JSON"] | str | int | float | bool | None

TRAIN_METADATA_PATH = Path("../data/processed/train_metadata.csv")

## Attempt 1: Retrieving all the ids one by one

According to the [Xeno Canto API documentation](https://xeno-canto.org/help/search), we can retrieve the metadata for a single id.
However, there seems to be no way to retrieve the metadata for multiple ids at once.
This means that we will have to retrieve the metadata for each id one by one.

This approach is not feasible as it seems that there is a hard server-side limit on the number of requests that can be made to the Xeno Canto API.

In [None]:
# Get the ids of all recordings in the dataset
# ids = pd.read_csv(RAW_TRAIN_METADATA_PATH)["url"].str.split("/").str.get(-1).astype(int).to_list()

# Retrieves metadata for requested recordings in the form of a JSON file
# def get_metadata(i: int) -> dict:
#     url = f"https://xeno-canto.org/api/2/recordings?query=nr:{i}"
#     try:
#         response = request.urlopen(url)
#         response_json = json.loads(response.read().decode('UTF-8'))
#         file_path = Path(f"../data/download/metadata/{i}.json")
#         with open(file_path, "w") as f:
#             json.dump(response_json, f)
#         recordings = response_json["recordings"]
#         if not recordings:
#             return {}
#         return recordings[0]
#     except error.HTTPError as e:
#         print(f"Error retrieving metadata for recording {i}: {e}")
#         return {}

# Get metadata for the recordings
# metadata = [get_metadata(i) for i in ids]
# metadata[:5]

KeyboardInterrupt: 

## Attempt 2: Retrieving all the metadata per species

According to the [Xeno Canto API documentation](https://xeno-canto.org/help/search), we can retrieve the metadata for a single species.
This means that we will retrieve too much metadata at once, but it makes less requests to the Xeno Canto API.
We can filter the metadata later on.

In [7]:
species = pd.read_csv(TRAIN_METADATA_PATH)["scientific_name"].unique()
len(species)

182

In [32]:
# DO NOT RUN THIS CELL IF YOU ALREADY HAVE THE METADATA JSON FILES

# Retrieves metadata for requested recordings in the form of a JSON file
import xenocanto

# Get metadata for the recordings
for name in species:
    xenocanto.metadata([name])

Retrieving metadata...
Downloading metadata page 1...
Downloading metadata page 2...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Metadata retrieval complete.
Retrieving metadata...
Downloading metadata page 1...
Me

In [3]:
# class XenoCantoAPIRecording(TypedDict, total=False):
#     id: int
#     gen: str
#     sp: str
#     ssp: str
#     group: str
#     en: str
#     rec: str
#     cnt: str
#     loc: str
#     lat: float
#     lng: float
#     alt: int
#     type: str
#     sex: str
#     stage: str
#     method: str
#     url: str
#     file: str
#     file_name: str
#     sono: Mapping[str, str]
#     osci: Mapping[str, str]
#     lic: str
#     q: float
#     length: str
#     time: str
#     date: str
#     uploaded: str
#     also: Sequence[str]
#     rmk: str
#     bird_seen: bool
#     animal_seen: bool
#     playback_used: bool
#     temp: str
#     regnr: str
#     auto: str
#     dvc: str
#     mic: str
#     smp: int

# class XenoCantoAPIResponse(TypedDict):
#     numRecordings: int
#     numSpecies: int
#     page: int
#     numPages: int
#     recordings: Sequence[XenoCantoAPIRecording]

In [5]:
# Once all the metadata has been retrieved, we merge all pages within a species and store all the recordings metadata in a dataframe
def merge_pages(name: str) -> list[JSON]:
    files = list((Path("./dataset/metadata/") / name.replace(" ", "")).glob("*.json"))
    pages = [{}] * (len(files))
    for i, file in enumerate(files):
        with open(file, "r") as f:
            page: JSON = json.load(f)
            pages[i] = page
    return list(itertools.chain.from_iterable([page["recordings"] for page in pages]))

records = list(itertools.chain.from_iterable([merge_pages(name) for name in species]))
meta_dataframe = pd.DataFrame(records).astype({"id": int}).sort_values(by="id")
meta_dataframe.to_csv("./dataset/metadata.csv", index=False)
meta_dataframe.head(5)

(48450, 38)


Unnamed: 0,id,gen,sp,ssp,group,en,rec,cnt,loc,lat,...,rmk,bird-seen,animal-seen,playback-used,temp,regnr,auto,dvc,mic,smp
2302,1135,Nycticorax,nycticorax,,birds,Black-crowned Night Heron,Don Jones,United States,"Jakes Landing Road, Cape May County, New Jersey",39.192751,...,,unknown,unknown,unknown,,,no,,,22050
39012,2778,Ardea,alba,,birds,Great Egret,Sjoerd Mayer,Bolivia,"Close to Trinidad, along road to San Javier, Beni",-14.8001,...,At the roost. cd:http://www.birdsongs.com/Boli...,unknown,unknown,unknown,,,no,,,44100
1538,2797,Nycticorax,nycticorax,,birds,Black-crowned Night Heron,Sjoerd Mayer,Bolivia,"Laguna Alalay, Cochabamba",-17.4084,...,cd:http://www.birdsongs.com/Bolivia/main.htm,unknown,unknown,unknown,,,no,,,44100
23497,4415,Hirundo,rustica,,birds,Barn Swallow,Glauco Alves Pereira,Brazil,"Engenho Santa Fé, Nazaré da Mata, Pernambuco",-7.731915,...,small group landed in an electric thread,unknown,unknown,unknown,,,no,,,22050
5803,5954,Passer,domesticus,,birds,House Sparrow,Manuel Grosselet,Mexico,san Augustin Etla,,...,,unknown,unknown,unknown,,,no,,,44100


In [18]:
# Match the metadata with the recordings in our dataset
ids = pd.read_csv(TRAIN_METADATA_PATH)["url"].str.split("/").str.get(-1).astype(int).to_list()
metadata = pd.read_csv("./dataset/metadata.csv")

train_metadata_xc = metadata[metadata["id"].isin(ids)]
train_metadata_xc.to_csv("./dataset/train_metadata_xc.csv", index=False)
train_metadata_xc.shape

  metadata = pd.read_csv("dataset/metadata.csv")


(23898, 38)

In [24]:
# Check missing ids; Some recordings have been removed from Xeno-Canto, so they are not in the metadata
missing_ids = set(ids) - set(train_metadata_xc["id"])
len(missing_ids)

542