### Download CoralNet (Notebook)

This notebook can be useful for experimenting with the functions in
`Download_CoralNet.py`. The script is designed to be run from the command line,
and will allow a user to download the images, annotations, labelset, and
model metadata from a list containing the IDs of sources they are interested
 in.

#### Import packages

In [1]:
from CoralNet_Download import *

#### Set up authentication

The first step is to authenticate with CoralNet. You need to provide your
username and password. If you don't have an account, you can create one at
https://coralnet.ucsd.edu/. If you don't want to provide your credentials
every time you run the script, you can store them in a separate file, or make
them user/environmental variables. If you don't want to store your credentials
in a file, you can also provide them as arguments when you run the script.

In [2]:
# Username
CORALNET_USERNAME = os.getenv("CORALNET_USERNAME")
USERNAME = input("Username: ") if not CORALNET_USERNAME else CORALNET_USERNAME

# Password
CORALNET_PASSWORD = os.getenv("CORALNET_PASSWORD")
PASSWORD = input("Password: ") if not CORALNET_PASSWORD else CORALNET_PASSWORD

try:
    # Authenticate
    authenticate(USERNAME, PASSWORD)
except Exception as e:
    print(e)

NOTE: Authentication successful for jordan.pierce@noaa.gov


#### Set up the output directory

This will be the root directory where all data for all sources will be saved.
Each source will be a sub-folder in this directory.

In [3]:
# Set the path to the root directory where you want to save the data for
# each source. The data will be saved in a subdirectory named after the source.
OUTPUT_DIR = "../CoralNet_Data/"

#### Download data from a single source

In this cell, we specify the source of interest by its ID, and create a sub-folder to hold all
the download data for it.

In [4]:
# The ID of the source to download data from
SOURCE_ID = 4060

# Where to store the downloaded data
SOURCE_DIR = os.path.abspath(OUTPUT_DIR) + f"\\{str(SOURCE_ID)}\\"

# Create the directory if it doesn't exist
os.makedirs(SOURCE_DIR, exist_ok=True)

#### Creating a driver (i.e, browser)

The CoralNet website is built using JavaScript, so we need to use a browser
driver to interact with it. The driver will be used to log in to CoralNet,
navigate to the source of interest, and download the data. The driver can be either Chrome,
Firefox or Edge.

We can also set the driver to be "headless", which means that the browser will not be visible
when the script is running. This is useful if you want to run the script in the background.

We also set the download directory to be the directory we created above.

In [5]:
options = Options()

# If True, browser will operate in the background
headless = True

if headless:
    # Set headless mode
    options.add_argument("--headless")
    options.add_argument("--disable-images")

# Pass the options object while creating the driver
driver = check_for_browsers(options=options)

# Set the download directory
download_settings = {
        "behavior": "allow",
        "downloadPath": SOURCE_DIR
}

driver.execute_cdp_cmd('Page.setDownloadBehavior', download_settings)

NOTE: Using Google Chrome


{}

#### Log in to CoralNet

The first step is to log in to CoralNet. The login function will take your username, password
and driver as input, and return a driver that has been logged in to CoralNet.

In [6]:
# Log in to CoralNet
driver, _ = login(driver, USERNAME, PASSWORD)

NOTE: Successfully logged in for jordan.pierce@noaa.gov


#### Get the images for the source

The next step is to get the images for the source. The function will take the driver and the ID
of the source, and crawl all of the pages for that source, extracting the image names, and image
 page URLs. This information will then be used to get the image URLs that are currently hosted
 on AWS.

Wit this dataframe, we can then download all of the images from AWS to the specified directory.

In [5]:
# Get the images for the source
driver, IMAGES = get_images(driver, SOURCE_ID)

if IMAGES is not None:
    # Get the image page URLs
    image_pages = IMAGES['image_page'].tolist()
    # Get the image AWS URLs
    IMAGES['image_url'] = get_image_urls(image_pages,
                                         USERNAME,
                                         PASSWORD)

    # Download the images to the specified directory
    download_images(IMAGES, SOURCE_DIR)

# Display some images
IMAGES.sample(3)

NOTE: Crawling all pages for source 4060


 50%|█████     | 1/2 [00:01<00:01,  1.43s/it]


NOTE: Finished crawling all pages
NOTE: Retrieving image URLs


100%|██████████| 30/30 [00:01<00:00, 18.20it/s]

NOTE: Retrieved 30 image URLs





Unnamed: 0,image_page,image_name,image_url
10,https://coralnet.ucsd.edu/image/3390411/view/,mcr_lter1_fringingreef_pole2-3_qu3_20080415.jpg,https://coralnet-production.s3.amazonaws.com:4...
15,https://coralnet.ucsd.edu/image/3392449/view/,mcr_lter1_fringingreef_pole2-3_qu8_20080415.jpg,https://coralnet-production.s3.amazonaws.com:4...
13,https://coralnet.ucsd.edu/image/3390415/view/,mcr_lter1_fringingreef_pole2-3_qu6_20080415.jpg,https://coralnet-production.s3.amazonaws.com:4...


#### Get the labelset for the source

This next cell will get the labelset for the source. The function will take the driver and the
ID of the source, and download the labelset as a CSV file. The CSV file is read into a
dataframe, and returned.

In [6]:
# Download the labelset for the source
driver, LABELSET = download_labelset(driver, SOURCE_ID, SOURCE_DIR)

# Display some labelset
LABELSET.sample(3)

NOTE: Downloading labelset for 4060
NOTE: Labelset saved successfully


Unnamed: 0,Label ID,Short Code
9,81,Macro
8,66,Lepta
13,85,Off


#### Get the metadata for the source

This cell will get the metadata for the source. The function will take the driver and the ID of
the source, and return all the model metadata associated with the source. If there is no model,
the function returns None.

In [18]:
# Download the metadata for the source
driver, META = download_metadata(driver, SOURCE_ID, SOURCE_DIR)

# Display some metadata
META.sample(1)

NOTE: Downloading model metadata for 4060


100%|██████████| 1/1 [00:00<?, ?it/s]

NOTE: Metadata saved successfully





Unnamed: 0,Classifier nbr,Accuracy %,N_Images,Train_Time,Date,Model_ID
0,1,84,30,0:00:29,2023-05-16,35231


#### Get the annotations for the source

This cell will download all the annotations in the source, choosing all of the options that are
available, including machine suggested annotations. Unfortunately, CoralNet takes quite a bit of
 time to serve up the annotations, so this cell can take a while to run. The function takes a
 "wait_time" variable, indicating how long to wait for the annotations to load (default is 3600
 seconds).

In [8]:
# Download the annotations for the source
driver, ANNOTATIONS = download_annotations(driver, SOURCE_ID, SOURCE_DIR)

# Display some annotations
ANNOTATIONS.sample(5)

NOTE: Downloading annotations for source 4060
NOTE: Annotations saved successfully


Unnamed: 0,Name,Date,Aux1,Aux2,Aux3,Aux4,Aux5,Height (cm),Latitude,Longitude,...,Machine suggestion 1,Machine confidence 1,Machine suggestion 2,Machine confidence 2,Machine suggestion 3,Machine confidence 3,Machine suggestion 4,Machine confidence 4,Machine suggestion 5,Machine confidence 5
5825,mcr_lter1_fringingreef_pole4-5_qu6_20080415.jpg,,,,,,,,,,...,,,,,,,,,,
5437,mcr_lter1_fringingreef_pole4-5_qu4_20080415.jpg,,,,,,,,,,...,,,,,,,,,,
3312,mcr_lter1_fringingreef_pole3-4_qu1_20080415.jpg,,,,,,,,,,...,,,,,,,,,,
2990,mcr_lter1_fringingreef_pole2-3_qu7_20080415.jpg,,,,,,,,,,...,,,,,,,,,,
2553,mcr_lter1_fringingreef_pole2-3_qu5_20080415.jpg,,,,,,,,,,...,,,,,,,,,,


#### Downloading a list of CoralNet Sources

These next few cells can be used to help download information from CoralNet that would help in
identifying potential sources of interest. For example, this first cell will download a list of
all the public sources on CoralNet, including ID, name, and page URL.

In [7]:
# Download the list of sources on CoralNet
driver, SOURCES = download_coralnet_sources(driver, OUTPUT_DIR)

# Display some sources
SOURCES.sample(5)

NOTE: Downloading CoralNet Source List


100%|██████████| 671/671 [00:00<00:00, 39525.01it/s]

NOTE: CoralNet Source list saved successfully





Unnamed: 0,Source_ID,Source_Name,Source_URL
146,3737,CPP_ROV,https://coralnet.ucsd.edu/source/3737/
28,3719,Adjacent Project,https://coralnet.ucsd.edu/source/3719/
375,2874,Maui_Puamana,https://coralnet.ucsd.edu/source/2874/
144,2971,Coral Reef Survey Imagery of James W. Porter i...,https://coralnet.ucsd.edu/source/2971/
643,3835,Victor MPA,https://coralnet.ucsd.edu/source/3835/


#### Downloading a list of CoralNet Labelsets

In additon to the sources, we can also download all the public labelsets on CoralNet, along with
 all the additional attributes associated with each. This includes the labelset name, ID, page
 URL, functional group, short code, and popularity.

In [8]:
# Download the list of labelsets on CoralNet
driver, LABELSETS = download_coralnet_labelsets(driver, OUTPUT_DIR)

# Display some labelsets
LABELSETS.sample(5)

NOTE: Downloading CoralNet Labelset List


100%|██████████| 7483/7483 [00:00<00:00, 13000.35it/s]


NOTE: Labelset list saved successfully


Unnamed: 0,Label ID,Name,URL,Functional Group,Popularity %,Short Code,Duplicate,Duplicate Notes,Verified,Has Calcification Rates
7153,4295,Red Filamentous with Sediment,https://coralnet.ucsd.edu/label/4295/,Algae,36,R_Fila_Sed,False,,False,False
3672,4181,Jellies,https://coralnet.ucsd.edu/label/4181/,Other Invertebrates,29,Jellies,False,,False,False
642,4885,C_MS_MON,https://coralnet.ucsd.edu/label/4885/,Hard coral,43,C_MS_MON,False,,False,False
564,5278,Branching - Non Acropora >50cm,https://coralnet.ucsd.edu/label/5278/,Hard coral,0,B50,False,,False,False
4554,3173,Sponges spons,https://coralnet.ucsd.edu/label/3173/,Other Invertebrates,0,Spons,False,,False,False


#### Downloading a list of CoralNet Sources that contain a specific labelset

Finally, you can download specific sources that contain at least one example of a labelset of
interest. In this example, we're interested in Acropora (branching) corals; we use the labelset
 dataframe to find the ID of the labelset, and then use that ID to find all the sources that
 contain at least one example of that labelset.

In return, we're provided a dataframe of sources, which we can then download the images, labelset,
 metadata and annotations for.

In [11]:
# Get the sources that contain the labelsets of interest
DESIRED_LABELSET = LABELSETS[LABELSETS['Name'] == "Acropora (branching)"]

# Filter the sources to only those that have the desired labelset
driver, SOURCE_LIST = get_sources_with(driver, DESIRED_LABELSET, OUTPUT_DIR)

# Display some sources
SOURCE_LIST.sample(5)

NOTE: Downloading list of sources


1it [00:10, 10.95s/it]

NOTE: Source ID List saved successfully





Unnamed: 0,Source_ID,Source_Name,Source_URL,Contains
23,3567,Tonga_2022-08 & Samoa_2022-09 & Samoa_2019-12 ...,https://coralnet.ucsd.edu/label/3567/,Acropora (branching)
7,3300,Memba MB19,https://coralnet.ucsd.edu/label/3300/,Acropora (branching)
6,2384,IvorKSA,https://coralnet.ucsd.edu/label/2384/,Acropora (branching)
22,3350,TEST Louise,https://coralnet.ucsd.edu/label/3350/,Acropora (branching)
20,2799,TEST,https://coralnet.ucsd.edu/label/2799/,Acropora (branching)


#### Download all the data

Now that we have a list of sources that contain the labelset of interest, we can download all the
 data for each source. This includes the images, labelset, metadata and annotations. This can
 take a while to run, so it's best to run this cell and then go do something else for a while.

In theory, it's possible to download all the public data from CoralNet, just know that the
annotations require a lot of time to download (this is a limitation of CoralNet, not this script).

In [None]:
# Loop through each of the source ids
for source_id in SOURCE_LIST['ID'].tolist():
    # Download the data for the source
    download_data(driver, source_id, OUTPUT_DIR)

#### Closing the browser

Last but not least, we close the browser.

In [12]:
driver.close()