### Download CoralNet (Notebook)

This notebook can be useful for experimenting with the functions in
`Download_CoralNet.py`. The script is designed to be run from the command line,
and will allow a user to download the images, annotations, labelset, and
model metadata from a list containing the IDs of sources they are interested
 in.

#### Import packages

In [1]:
import sys
sys.path.append("../")

from ..Tools.Download import *

#### Set up authentication

The first step is to authenticate with CoralNet. You need to provide your
username and password. If you don't have an account, you can create one at
https://coralnet.ucsd.edu/. If you don't want to provide your credentials
every time you run the script, you can store them in a separate file, or make
them user/environmental variables. If you don't want to store your credentials
in a file, you can also provide them as arguments when you run the script.

In [2]:
# Username
CORALNET_USERNAME = os.getenv("CORALNET_USERNAME")
USERNAME = input("Username: ") if not CORALNET_USERNAME else CORALNET_USERNAME

# Password
CORALNET_PASSWORD = os.getenv("CORALNET_PASSWORD")
PASSWORD = input("Password: ") if not CORALNET_PASSWORD else CORALNET_PASSWORD

try:
    # Authenticate
    authenticate(USERNAME, PASSWORD)
except Exception as e:
    print(e)


###############################################
Authentication
###############################################

NOTE: Authenticating user jordan.pierce@noaa.gov
NOTE: Authentication successful for jordan.pierce@noaa.gov


#### Set up the output directory

This will be the root directory where all data for all sources will be saved.
Each source will be a sub-folder in this directory.

In [3]:
# Set the path to the root directory where you want to save the data for
# each source. The data will be saved in a subdirectory named after the source.
OUTPUT_DIR = "../CoralNet_Data/"

#### Download data from a single source

In this cell, we specify the source of interest by its ID, and create a sub-folder to hold all
the download data for it.

In [4]:
# The ID of the source to download data from
SOURCE_ID = 4085

# Where to store the downloaded data
SOURCE_DIR = os.path.abspath(OUTPUT_DIR) + f"\\{str(SOURCE_ID)}\\"

# Create the directory if it doesn't exist
os.makedirs(SOURCE_DIR, exist_ok=True)

#### Creating a driver (i.e, browser)

The CoralNet website is built using JavaScript, so we need to use a browser
driver to interact with it. The driver will be used to log in to CoralNet,
navigate to the source of interest, and download the data. The driver can be either Chrome,
Firefox or Edge.

We can also set the driver to be "headless", which means that the browser will not be visible
when the script is running. This is useful if you want to run the script in the background.

We also set the download directory to be the directory we created above.

In [5]:
# If True, browser will operate in the background
headless = True

# Pass the options object while creating the driver
driver = check_for_browsers(headless)

# Store the credentials in the driver
driver.capabilities['credentials'] = {
    "username": USERNAME,
    "password": PASSWORD
}

# Set the download directory
download_settings = {
        "behavior": "allow",
        "downloadPath": SOURCE_DIR
}

driver.execute_cdp_cmd('Page.setDownloadBehavior', download_settings);


###############################################
Browser
###############################################

NOTE: Using Google Chrome


{}

#### Log in to CoralNet

The first step is to log in to CoralNet. The login function will take your username, password
and driver as input, and return a driver that has been logged in to CoralNet.

In [6]:
# Log in to CoralNet
driver, _ = login(driver)


###############################################
Login
###############################################

NOTE: Successfully logged in for jordan.pierce@noaa.gov


#### Get the images for the source

The next step is to get the images for the source. The function will take the driver and the ID
of the source, and crawl all of the pages for that source, extracting the image names, and image
 page URLs. This information will then be used to get the image URLs that are currently hosted
 on AWS.

Wit this dataframe, we can then download all of the images from AWS to the specified directory.

In [11]:
# Get the images for the source
driver, IMAGES = get_images(driver, SOURCE_ID)

if IMAGES is not None:
    # Get the image page URLs
    image_pages = IMAGES['Image Page'].tolist()
    # Get the image AWS URLs
    driver, IMAGES['Image URL'] = get_image_urls(driver, image_pages)

    # Download the images to the specified directory
    download_images(IMAGES, SOURCE_DIR)

# Display some images
IMAGES.sample(3)


NOTE: Crawling all pages for source 4085


  0%|          | 0/1 [00:00<?, ?it/s]


NOTE: Finished crawling all pages
NOTE: Retrieving image URLs



100%|██████████| 6/6 [00:00<00:00,  8.05it/s]


NOTE: Retrieved 6 image URLs
NOTE: Saved image dataframe as CSV file

NOTE: Downloading 6 images


100%|██████████| 6/6 [00:01<00:00,  3.46it/s]


Unnamed: 0,Image Page,Name,Image URL
3,https://coralnet.ucsd.edu/image/3395583/view/,Screenshot (113).png,https://coralnet-production.s3.amazonaws.com:4...
2,https://coralnet.ucsd.edu/image/3395582/view/,Screenshot (112).png,https://coralnet-production.s3.amazonaws.com:4...
0,https://coralnet.ucsd.edu/image/3395580/view/,0-5.png,https://coralnet-production.s3.amazonaws.com:4...


#### Get the labelset for the source

This next cell will get the labelset for the source. The function will take the driver and the
ID of the source, and download the labelset as a CSV file. The CSV file is read into a
dataframe, and returned.

In [12]:
# Download the labelset for the source
driver, LABELSET = download_labelset(driver, SOURCE_ID, SOURCE_DIR)

# Display some labelset
LABELSET.sample(3)


NOTE: Downloading labelset for 4085
NOTE: Labelset saved successfully


Unnamed: 0,Label ID,Short Code
3,4584,_OMOL
7,102,SP
2,3438,HS


#### Get the metadata for the source

This cell will get the metadata for the source. The function will take the driver and the ID of
the source, and return all the model metadata associated with the source. If there is no model,
the function returns None.

In [15]:
# Download the metadata for the source
driver, META = download_metadata(driver, SOURCE_ID, SOURCE_DIR)

if META is not None:
    # Display some metadata
    META.sample(1)


NOTE: Downloading model metadata for 4085
ERROR: Issue with downloading metadata


#### Get the annotations for the source

This cell will download all the annotations in the source, choosing all of the options that are
available, including machine suggested annotations. Unfortunately, CoralNet takes quite a bit of
 time to serve up the annotations, so this cell can take a while to run. The function takes a
 "wait_time" variable, indicating how long to wait for the annotations to load (default is 3600
 seconds).

In [16]:
# Download the annotations for the source
driver, ANNOTATIONS = download_annotations(driver, SOURCE_ID, SOURCE_DIR)

# Display some annotations
ANNOTATIONS.sample(5)


NOTE: Downloading annotations for source 4085
NOTE: Annotations saved successfully


Unnamed: 0,Name,Date,Aux1,Aux2,Aux3,Aux4,Aux5,Height (cm),Latitude,Longitude,...,Machine suggestion 1,Machine confidence 1,Machine suggestion 2,Machine confidence 2,Machine suggestion 3,Machine confidence 3,Machine suggestion 4,Machine confidence 4,Machine suggestion 5,Machine confidence 5
111,Screenshot (112).png,,,,,,,,,,...,,,,,,,,,,
234,Screenshot (115).png,,,,,,,,,,...,,,,,,,,,,
3,0-5.png,,,,,,,,,,...,,,,,,,,,,
72,Screenshot (111).png,,,,,,,,,,...,,,,,,,,,,
168,Screenshot (114).png,,,,,,,,,,...,,,,,,,,,,


#### Downloading a list of CoralNet Sources

These next few cells can be used to help download information from CoralNet that would help in
identifying potential sources of interest. For example, this first cell will download a list of
all the public sources on CoralNet, including ID, name, and page URL.

In [9]:
# Download the list of sources on CoralNet
driver, SOURCES = download_coralnet_sources(driver, OUTPUT_DIR)

# Display some sources
SOURCES.sample(5)


###############################################
CoralNet Source Dataframe
###############################################

NOTE: Downloading CoralNet Source Dataframe


100%|██████████| 689/689 [00:00<00:00, 36072.49it/s]

NOTE: CoralNet Source Dataframe saved successfully





Unnamed: 0,Source_ID,Source_Name,Source_URL
292,1270,Josiah Johnson,https://coralnet.ucsd.edu/source/1270/
624,2307,Therese_East,https://coralnet.ucsd.edu/source/2307/
0,4085,0-5meters,https://coralnet.ucsd.edu/source/4085/
66,3818,BelizeBarrierL1,https://coralnet.ucsd.edu/source/3818/
68,2614,Benthic foramifera associated with seagrass H....,https://coralnet.ucsd.edu/source/2614/


#### Downloading a list of CoralNet Labelsets

In additon to the sources, we can also download all the public labelsets on CoralNet, along with
 all the additional attributes associated with each. This includes the labelset name, ID, page
 URL, functional group, short code, and popularity.

In [10]:
# Download the list of labelsets on CoralNet
driver, LABELSETS = download_coralnet_labelsets(driver, OUTPUT_DIR)

# Display some labelsets
LABELSETS.sample(5)


###############################################
CoralNet Labelset Dataframe
###############################################

NOTE: Downloading CoralNet Labelset Dataframe


100%|██████████| 7547/7547 [00:00<00:00, 10963.16it/s]


NOTE: Labelset Dataframe saved successfully


Unnamed: 0,Label ID,Name,URL,Functional Group,Popularity %,Short Code,Duplicate,Duplicate Notes,Verified,Has Calcification Rates
962,1558,Echinophyllia aspera,https://coralnet.ucsd.edu/label/1558/,Hard coral,56,EcAsp,False,,False,False
3452,3041,Fan Worm,https://coralnet.ucsd.edu/label/3041/,Other Invertebrates,33,WORM fan,False,,False,False
477,762,Bleached Platygyra ryukyuensis,https://coralnet.ucsd.edu/label/762/,Hard coral,61,B_Pryu,False,,False,False
379,3061,Bleached Leptastrea,https://coralnet.ucsd.edu/label/3061/,Hard coral,0,B_Leptastr,False,,False,False
2083,5100,Siderastrea_encrusting,https://coralnet.ucsd.edu/label/5100/,Hard coral,0,sidenc,False,,False,False


#### Downloading a list of CoralNet Sources that contain a specific labelset

Finally, you can download specific sources that contain at least one example of a labelset of
interest. In this example, we're interested in Acropora (branching) corals; we use the labelset
 dataframe to find the ID of the labelset, and then use that ID to find all the sources that
 contain at least one example of that labelset.

In return, we're provided a dataframe of sources, which we can then download the images, labelset,
 metadata and annotations for.

In [18]:
# Get the sources that contain the labelsets of interest
DESIRED_LABELSET = LABELSETS[LABELSETS['Name'] == "Acropora (branching)"]

# Filter the sources to only those that have the desired labelset
driver, SOURCE_LIST = get_sources_with(driver, DESIRED_LABELSET, OUTPUT_DIR)

if SOURCE_LIST is not None:
    # Display some sources
    SOURCE_LIST.sample(5)

NOTE: Downloading dataframe of sources


1it [00:02,  2.03s/it]

ERROR: Unable to get dataframe of Source IDs
ERROR: Could not save Source ID dataframe





#### Download all the data

Now that we have a list of sources that contain the labelset of interest, we can download all the
 data for each source. This includes the images, labelset, metadata and annotations. This can
 take a while to run, so it's best to run this cell and then go do something else for a while.

In theory, it's possible to download all the public data from CoralNet, just know that the
annotations require a lot of time to download (this is a limitation of CoralNet, not this script).

In [16]:
# Loop through each of the source ids, and download the data
# Here we just do the first source, but you can do all of them.
for source_id in SOURCE_LIST['Source_ID'].tolist()[0:1]:
    # Download the data for the source
    download_data(driver, source_id, OUTPUT_DIR)

NOTE: Downloading model metadata for 1653
ERROR: Issue with downloading metadata
ERROR: Unable to get model metadata from source 1653
ERROR: Source 1653 may not have a trained model
NOTE: Downloading labelset for 1653
NOTE: Labelset saved successfully
NOTE: Crawling all pages for source 1653


 50%|█████     | 1/2 [00:01<00:01,  1.69s/it]


NOTE: Finished crawling all pages
NOTE: Retrieving image URLs


100%|██████████| 30/30 [00:05<00:00,  5.83it/s]


NOTE: Retrieved 30 image URLs
NOTE: Saved image dataframe as CSV file
NOTE: Downloading 30 images


100%|██████████| 30/30 [00:16<00:00,  1.79it/s]


NOTE: Downloading annotations for source 1653


#### Closing the browser

Last but not least, we close the browser.

In [20]:
driver.close()