# **.wav Files and Metadata Scraper**

The following code downloads the audio files of **'Best of' cuts** section from the [Watkins Marine Mammal Sound Database](https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm) along with their metadata. The 'Best of' cuts section contains 1,694 sound cuts deemed to be of higher sound quality and lower noise from 32 different species.

In [1]:
# Importing the drive module from google.colab library
from google.colab import drive

# Mounting the Google Drive to the Colab environment
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Importing necessary libraries
from bs4 import BeautifulSoup
import urllib
import re
import requests
import json
import os

project_path = '/content/drive/My Drive/GitHub/MarineMammalSoundClassification/'

### **Species List**

Using the following code, a list of all available mammals was created, including their common and scientific names along with their codes. The resulting list is stored in the file `/metadata/species.json` for convenient future access.

In [3]:
# URL of the Watkins Marine Mammal Sound Dataset 'BEST OF' CUTS page to scrape
best_of_link = 'https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm'

# Open the URL and read its content
url = urllib.request.urlopen(best_of_link)
content = url.read()

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(content, 'html.parser')

# Dictionary to store species information
species = {}

# Find all div elements with class 'large-3 columns'
div_list = soup.find_all('div', class_='large-3 columns')

for sdiv in div_list:
    try:
        common_name = sdiv.a.div.h3.get_text()
        # Generate a folder name from the common name
        folder = common_name.replace(" ", "").replace(",", "_").replace("-", "_")
        scientific_name = sdiv.a.div.img['src'].split('/')[-1].replace(".png", "").replace("-", " ")
        code = sdiv.a['href'].split('code=')[-1]
        # Store species information in the dictionary
        species[folder] = {
            'common_name': common_name,
            'scientific_name': scientific_name,
            'code': code
        }
    except Exception as e:
        # Skip if a div is not in the expected formand and continue to the next iteration
        continue

# Convert the dictionary to JSON format
species_json = json.dumps(species, indent=4)

# Write the JSON data to a file
with open(os.path.join(project_path,'metadata/species.json'), 'w') as fp:
    fp.write(species_json)

### **Download .wav files**


The function `download_species_wav_files` was utilized to download all available .wav files for each species in species.json. Its code was used to construct the necessary link.

In [4]:
def download_species_wav_files(link, output_path):
    """
    Download audio files (wav, mp3, ogg, wma) linked on a webpage.

    Args:
        link (str): The URL of the webpage containing the audio links.
        output_path (str): The directory where the downloaded files will be saved.

    Returns:
        None
    """
    # Open the URL and read its content
    url = urllib.request.urlopen(link)
    content = url.read()

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(content, 'html.parser')

    # Find all anchor elements with links to audio files
    links = [a['href'] for a in soup.find_all('a', href=re.compile(r'.*\.(mp3|wav|ogg|wma)'))]

    print("{} Audios Found".format(len(links)))

    # Headers for the request
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
    }

    # Download each audio file
    for wav_link in links:
        # Construct the complete URL of the audio file
        u = 'https://whoicf2.whoi.edu' + wav_link
        # Define the path to save the audio file
        wav_file = os.path.join(output_path, wav_link.split("/")[-1])
        # Download and save the audio file
        with open(wav_file, "wb") as f_out:
            f_out.write(requests.get(u, headers=headers).content)


In [5]:
with open(os.path.join(project_path,'metadata/species.json')) as f:
    species = json.load(f)

# Iterate over each species in the dictionary
for key, value in species.items():
    print("Downloading files for:", key)

    # Create a folder for the mammal if it doesn't exist
    mammal_folder = os.path.join(project_path, 'data/', key)
    if not os.path.exists(mammal_folder):
        os.makedirs(mammal_folder)

    # Construct the URL for the mammal's audio files
    mammal_link = 'https://whoicf2.whoi.edu/science/B/whalesounds/bestOf.cfm?code=' + value['code']

    # Download the audio files for the mammal and save them in the folder
    download_species_wav_files(mammal_link, mammal_folder)


Downloading files for: AtlanticSpottedDolphin
58 Audios Found
Downloading files for: BeardedSeal
37 Audios Found
Downloading files for: Beluga_WhiteWhale
50 Audios Found
Downloading files for: BottlenoseDolphin
24 Audios Found
Downloading files for: BowheadWhale
60 Audios Found
Downloading files for: ClymeneDolphin
63 Audios Found
Downloading files for: CommonDolphin
52 Audios Found
Downloading files for: FalseKillerWhale
59 Audios Found
Downloading files for: Fin_FinbackWhale
50 Audios Found
Downloading files for: Fraser'sDolphin
87 Audios Found
Downloading files for: Grampus_Risso'sDolphin
67 Audios Found
Downloading files for: HarpSeal
47 Audios Found
Downloading files for: HumpbackWhale
64 Audios Found
Downloading files for: KillerWhale
35 Audios Found
Downloading files for: LeopardSeal
10 Audios Found
Downloading files for: Long_FinnedPilotWhale
70 Audios Found
Downloading files for: MelonHeadedWhale
63 Audios Found
Downloading files for: MinkeWhale
17 Audios Found
Downloading fil

### **Get matadata for each .wav files**

The def get_wav_metadata function was employed to scrape the available metadata for each .wav file previously downloaded and saving them in the `/metadata/wav_metadata.json` file for potential future use.

In [6]:
def get_wav_metadata(wav_metadata_link):
    """
    Scrape metadata from a WAV file's webpage.

    Args:
        wav_metadata_link (str): The URL of the webpage containing the WAV file's metadata.

    Returns:
        dict: A dictionary containing the extracted metadata.
    """
    # Open the URL and read its content
    url = urllib.request.urlopen(wav_metadata_link)
    content = url.read()

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(content, 'html.parser')

    # Dictionary to store metadata
    metadata = {}

    # Find all tables on the webpage
    tables = soup.find_all('table')

    # The metadata table is the second table on the page
    meta_table = tables[1]

    # Iterate over each row in the metadata table
    for row in meta_table.find_all('tr'):
        cells = row.find_all('td')
        # Check if the row contains two cells (key and value)
        if len(cells) == 2:
            # Add the key-value pair to the metadata dictionary
            metadata[cells[0].text] = cells[1].text

    return metadata


In [7]:
# Initialize an empty dictionary to store metadata
metadata = {}

with open(os.path.join(project_path,'metadata/species.json')) as f:
    species = json.load(f)

# Iterate over each species in the dictionary
for key, value in species.items():
    print("Creating metadata for:", key)

    # Construct the path to the folder containing WAV files for the current species
    mammal_folder = os.path.join(project_path, 'data/', key)

    # Get the list of WAV files in the species folder
    list_of_wav = os.listdir(mammal_folder)

    for wav_file in list_of_wav:
        # Extract the WAV code from the file name
        wav_code = wav_file[:-4]

        # Construct the URL for the WAV file's metadata
        wav_metadata_link = 'https://whoicf2.whoi.edu/science/B/whalesounds/metaData.cfm?RN=' + wav_code

        # Get metadata for the WAV file
        wav_meta = get_wav_metadata(wav_metadata_link)

        # Add the metadata to the metadata dictionary using the WAV code as the key
        metadata[wav_code] = wav_meta

# Convert the metadata dictionary to JSON format
metadata_json = json.dumps(metadata, indent=4)

# Write the JSON data to a file
with open(os.path.join(project_path,'metadata/wav_metadata.json'), 'w') as fp:
    fp.write(metadata_json)

Creating metadata for: AtlanticSpottedDolphin
Creating metadata for: BeardedSeal
Creating metadata for: Beluga_WhiteWhale
Creating metadata for: BottlenoseDolphin
Creating metadata for: BowheadWhale
Creating metadata for: ClymeneDolphin
Creating metadata for: CommonDolphin
Creating metadata for: FalseKillerWhale
Creating metadata for: Fin_FinbackWhale
Creating metadata for: Fraser'sDolphin
Creating metadata for: Grampus_Risso'sDolphin
Creating metadata for: HarpSeal
Creating metadata for: HumpbackWhale
Creating metadata for: KillerWhale
Creating metadata for: LeopardSeal
Creating metadata for: Long_FinnedPilotWhale
Creating metadata for: MelonHeadedWhale
Creating metadata for: MinkeWhale
Creating metadata for: Narwhal
Creating metadata for: NorthernRightWhale
Creating metadata for: PantropicalSpottedDolphin
Creating metadata for: RossSeal
Creating metadata for: Rough_ToothedDolphin
Creating metadata for: Short_Finned(Pacific)PilotWhale
Creating metadata for: SouthernRightWhale
Creating