### CoralNet API (Notebook)

This notebook can be used to pass images and specific points to a CoralNet
model for prediction. The images and points are passed to CoralNet in batches
of 5, and the status of each job is checked every 75 seconds. If a job fails
to upload, it will be added to a list of expired images and will be attempted
again later. Once all images have been processed, the notebook will stop.

#### Import Packages

In [1]:
from CoralNet_API import *
from CoralNet_Download import *

#### Set up authentication

The first step is to authenticate with CoralNet. You need to provide your
username and password. If you don't have an account, you can create one at
https://coralnet.ucsd.edu/. If you don't want to provide your credentials
every time you run the script, you can store them in a separate file, or make
them user/environmental variables. If you don't want to store your credentials
in a file, you can also provide them as arguments when you run the script.

In [2]:
# Username
CORALNET_USERNAME = os.getenv("CORALNET_USERNAME")
USERNAME = input("Username: ") if not CORALNET_USERNAME else CORALNET_USERNAME

# Password
CORALNET_PASSWORD = os.getenv("CORALNET_PASSWORD")
PASSWORD = input("Password: ") if not CORALNET_PASSWORD else CORALNET_PASSWORD

try:
    # Authenticate
    authenticate(USERNAME, PASSWORD)
    CORALNET_TOKEN, HEADERS = get_token(USERNAME, PASSWORD)
except Exception as e:
    print(e)

NOTE: Successfully logged in for jordan.pierce@noaa.gov
NOTE: Successful authentication


#### Prepare the data

The first step is to set the `SOURCE_ID` variable to represent the source
that contains the model we want to use. We can then use the `get_model_meta`
function to get the metadata for the model. This metadata includes the
`MODEL_ID`. We can then use the `get_images` function to get the images
associated with the source. The `get_images` function returns a dataframe
that contains the `image_name` and `image_url` for each image. If there are
many images in the source, it may take a few minutes.

In [3]:
# Desired source provided by user
SOURCE_ID = str(3420)

# Variables for the model
metadata = get_model_meta(SOURCE_ID, USERNAME, PASSWORD)
MODEL_ID = metadata['Model_ID'][0]
MODEL_URL = CORALNET_URL + f"/api/classifier/{MODEL_ID}/deploy/"

# All images associated with the source
SOURCE_IMAGES = get_images(SOURCE_ID, USERNAME, PASSWORD)

Downloading Metadata...
Crawling for Images...


In [4]:
SOURCE_IMAGES.sample(3)

Unnamed: 0,image_name,image_page,image_url
133,mcr_lter2_fringingreef_pole3-4_qu5_20080404.jpg,https://coralnet.ucsd.edu/image/2855652/view/,https://coralnet-production.s3.amazonaws.com:4...
611,mcr_lter6_out10m_pole3-4_qu5_20080406.jpg,https://coralnet.ucsd.edu/image/2853076/view/,https://coralnet-production.s3.amazonaws.com:4...
243,mcr_lter3_fringingreef_pole2-3_qu5_20080503.jpg,https://coralnet.ucsd.edu/image/2855213/view/,https://coralnet-production.s3.amazonaws.com:4...


Below we set the `DATA_ROOT` variable to represent the root directory where
subdirectories for all sources will be created. We also set the `SOURCE_DIR`
 variable to represent the directory where the current source's data will be
  saved. We  create a folder to hold the points we want to sample from each
  image, and the predictions for those points we'll get from the model.

In [5]:
# Set the path to the root directory where you want to save the data for
# each source. The data will be saved in a subdirectory named after the source.
DATA_ROOT = "../CoralNet_Data/"

# Where the output predictions will be stored
SOURCE_DIR = DATA_ROOT + SOURCE_ID + "/"
SOURCE_POINTS = SOURCE_DIR + "points/"
SOURCE_PREDICTIONS = SOURCE_DIR + "predictions/"

# Create a folder to contain predictions and points
os.makedirs(SOURCE_DIR, exist_ok=True)
os.makedirs(SOURCE_POINTS, exist_ok=True)
os.makedirs(SOURCE_PREDICTIONS, exist_ok=True)

CoralNet's API requires that the images be passed as a URL that is
publicly accessible. You can upload images to a cloud-based storage and get
the URLs for each image, or you can upload the images to CoralNet (which
stores them in AWS), and then download the URLs for each image using the
`Download_CoralNet.py` script. The latter is the recommended approach.

With each image, you also need to provide the points that you want to
predict. These points should be in CSV file that has the following columns:
- `image_name`: The name of the image that the points are associated with
- `Row`: The row of the point
- `Column`: The column of the point

You can either provide a single CSV for all images, or a CSV for each image
(they will be concatenated together). The cell below shows an example of a
few images we want predictions for.

In [6]:
# A list of image names we want predictions for. For this example,
# we'll pretend that we have already uploaded the images to CoralNet.
SOURCE_IMAGES['image_name'].sample(10).tolist()

['mcr_lter6_out17m_pole4-5_qu4_20080406.jpg',
 'mcr_lter4_out10m_pole4-5_qu5_20080407.jpg',
 'mcr_lter1_out10m_pole1-2_qu5_20080402.jpg',
 'mcr_lter3_fringingreef_pole4-5_qu3_20080503.jpg',
 'mcr_lter2_fringingreef_pole5-6_qu5_20080404.jpg',
 'mcr_lter2_out10m_pole3-4_qu5_20080410.jpg',
 'mcr_lter4_out17m_pole4-5_qu7_20080407.jpg',
 'mcr_lter6_fringingreef_pole2-3_qu3_20080502.jpg',
 'mcr_lter5_out10m_pole2-3_qu1_20080408.jpg',
 'mcr_lter4_fringingreef_pole3-4_qu8_20080504.jpg']

**Enter the image names you want predictions for here (as a list)**

In [7]:
desired_images = SOURCE_IMAGES['image_name']

# We will get the information needed from the source images dataframe
IMAGES = SOURCE_IMAGES[SOURCE_IMAGES['image_name'].isin(desired_images)]

In [8]:
IMAGES

Unnamed: 0,image_name,image_page,image_url
0,mcr_lter1_fringingreef_pole1-2_qu1_20080415.jpg,https://coralnet.ucsd.edu/image/2856289/view/,https://coralnet-production.s3.amazonaws.com:4...
1,mcr_lter1_fringingreef_pole1-2_qu2_20080415.jpg,https://coralnet.ucsd.edu/image/2856284/view/,https://coralnet-production.s3.amazonaws.com:4...
2,mcr_lter1_fringingreef_pole1-2_qu3_20080415.jpg,https://coralnet.ucsd.edu/image/2856279/view/,https://coralnet-production.s3.amazonaws.com:4...
3,mcr_lter1_fringingreef_pole1-2_qu4_20080415.jpg,https://coralnet.ucsd.edu/image/2856274/view/,https://coralnet-production.s3.amazonaws.com:4...
4,mcr_lter1_fringingreef_pole1-2_qu5_20080415.jpg,https://coralnet.ucsd.edu/image/2856272/view/,https://coralnet-production.s3.amazonaws.com:4...
...,...,...,...
666,mcr_lter6_out17m_pole5-6_qu4_20080406.jpg,https://coralnet.ucsd.edu/image/2852711/view/,https://coralnet-production.s3.amazonaws.com:4...
667,mcr_lter6_out17m_pole5-6_qu5_20080406.jpg,https://coralnet.ucsd.edu/image/2852705/view/,https://coralnet-production.s3.amazonaws.com:4...
668,mcr_lter6_out17m_pole5-6_qu6_20080406.jpg,https://coralnet.ucsd.edu/image/2852700/view/,https://coralnet-production.s3.amazonaws.com:4...
669,mcr_lter6_out17m_pole5-6_qu7_20080406.jpg,https://coralnet.ucsd.edu/image/2852695/view/,https://coralnet-production.s3.amazonaws.com:4...


For each of these images, we need to specify the points on the image we
want the model to make predictions for. Here we use a function that will
sample 200 points from each image. For demonstration purposes, we save
these points as a CSV file in the `SOURCE_POINTS` folder.

If you have your own CSV file(s), simply add the file paths to the
`POINT_PATHS` list (see next cell).

In [9]:
# Creating points for each of the desired images
for image in desired_images:
    # We use the SOURCE_IMAGES dataframe to get the URL of the image
    image_url = IMAGES[IMAGES['image_name'] == image]['image_url'].values[0]
    # Then we sample points from the image
    x, y, samples = sample_points_for_url(image_url, num_samples=200, method='stratified')
    # If the url hasn't expired
    if samples is not None:
        # Create a points dataframe for the image
        points_df = pd.DataFrame(samples)
        points_df['image_name'] = image
        # Finally we save the points to a csv file in the SOURCE_POINTS folder
        points_df.to_csv(SOURCE_POINTS + image + ".csv", index=False)

Here we get all the points for each of the desired images. If you already
have one or multiple CSV files with the points, you can simply add the
file paths to the list below.

**Enter the file paths to the CSV files here (as a list)**

In [10]:
# Get all the points for all the images
POINT_PATHS = glob.glob(SOURCE_POINTS + "*.csv")

# This dataframe will contain all the points for all the images
# The columns are `image_name`, `Row`, and `Column`.
POINTS = pd.DataFrame()
# We then concatenate all the points into a single dataframe
for path in POINT_PATHS:
    points = pd.read_csv(path)
    points['image_name'] = os.path.basename(path)
    POINTS = pd.concat([POINTS, points])

Here we can see that all CSV files for all images have been concatenated
into a single dataframe. The `image_name` column represents the image that
the points are associated with.

In [11]:
POINTS.sample(5)

Unnamed: 0,row,column,image_name
136,1407,1292,mcr_lter6_out10m_pole1-2_qu8_20080406.jpg.csv
92,1198,900,mcr_lter3_out17m_pole4-5_qu7_20080404.jpg.csv
117,719,1163,mcr_lter6_out17m_pole4-5_qu3_20080406.jpg.csv
4,663,96,mcr_lter5_out17m_pole5-6_qu2_20080408.jpg.csv
128,415,1294,mcr_lter1_out17m_pole2-3_qu7_20080411.jpg.csv


#### Make predictions

This is the main part of the script. We loop through each image, get the
points for that image, and then make predictions for those points. We then
save the predictions to a CSV file in the `SOURCE_PREDICTIONS` folder.

There are multiple loops in this section. The first loop continues until all
images have been processed. The first inner for loop prepares the data for the
model, by creating a JSON object that contains the image URL and the points.
These are stored in a queued list, representing the images that are waiting
to be processed. The second inner while loop checks to see if there are any
open positions (only 5 are allowed at a time). If there are, it will submit
a queued job to the model until all the positions are filled. The third
inner while loop checks the status of each job. If the job is complete, it
will save the predictions to a CSV file, and remove the job from the active
list. If the job is still running, it will wait 75 seconds before checking
the status again. Once all the jobs are complete, the outer while loop will
end, and the script will finish.

Because the images are hosted on AWS, there is a chance that the URL will
expire before the model can make a prediction. If this happens, the script
will catch the error, and add the image to a list of expired images. Once
all the images have been processed, the script will loop through the expired
images, and update the URLs. It will then re-run the predictions for those
images.

In [12]:
# Jobs that are currently queued
queued_jobs = []
queued_imgs = []
# Jobs that are currently active
active_jobs = []
active_imgs = []
# Jobs that are completed
completed_jobs = []
completed_imgs = []
# A list that contains just the images that need updated urls
expired_imgs = []
# Flag to indicate if all images have been passed to model
finished = False
# The amount of time to wait before checking the status of a job
patience = 75
# The number of images to include in each job
data_per_payload = 100

# This will continue looping until all images have been processed
while not finished:

    # A list for images and data that have been sampled this round
    payload_data = []
    payload_imgs = []

    for index, row in IMAGES.iterrows():
    # Loops through each image requested, gets points, adds to a queue

        # Get the current image name and url
        name = row['image_name']
        url = row['image_url']

        # If this image has already been completed, skip it.
        if name in completed_imgs:
            # print(f"Image {name} already completed; skipping")
            continue # Skip to the next image within the current for loop

        # If this image is already in active, skip it.
        elif any(name in n for n in active_imgs):
            # print(f"Image {name} already active; skipping")
            continue # Skip to the next image within the current for loop

        elif any(name in n for n in queued_imgs):
            # print(f"Image {name} already queued; skipping")
            continue # Skip to the next image within the current for loop

        # The image url has not expired, so we can queue the image
        elif not is_expired(url):
            print(f"NOTE: Image {name} not in queue; sampling")
            points = POINTS[POINTS['image_name'].str.contains(name)]
            points = points[['row', 'column']].to_dict(orient="records")

            # Add the data to the list for payloads
            payload_imgs.append(name)
            payload_data.append(
                {
                    "type": "image",
                    "attributes": {
                        "name": name,
                        "url": url,
                        "points": points
                    }
                })

        else:
            # The image url expired, so we need to update it later.
            print(f"WARNING: {name} expired; adding to expired list")
            expired_imgs.append(name)
            continue # Skip to the next image within the current for loop

    # Here we initialize the payload, which is a JSON object that
    # contains the image URLs and their points; payloads will contain
    # batches of data (N = data_per_payload).
    for _ in np.arange(0, len(payload_imgs), data_per_payload):
        # Get the image names and data for the payload
        image_names = payload_imgs[_ : _ + data_per_payload]
        payload = {'data': payload_data[_ : _ + data_per_payload]}
        # Use the payload to construct the job
        job = {
                "headers": HEADERS,
                "model_url": MODEL_URL,
                "image_names": image_names,
                "data": json.dumps(payload, indent=4),

              }
        # Add the job to the queue
        queued_jobs.append(job)
        queued_imgs.append(image_names)

    # Print the status of the jobs
    print_job_status(queued_jobs, active_jobs, completed_jobs, expired_imgs)

    # Start uploading the queued jobs to CoralNet if there are
    # less than 5 active jobs, and there are more in the queue.
    # If there are no queued jobs, this won't need to be entered.
    while len(active_jobs) < 5 and len(queued_jobs) > 0:

        # Loop through all the queued jobs
        for job, names in list(zip(queued_jobs, queued_imgs)):

            # Break when active gets to 5
            if len(active_jobs) >= 5:
                print("NOTE: Maximum number of active jobs reached; checking status")
                break # Breaks from both loops, since the while loop condition is met

            # Upload the image and the sampled points to CoralNet
            print(f"NOTE: Attempting to upload {len(names)} images")

            # Sends the requests to the `source` and in exchange, receives
            # a message telling if it was received correctly.
            response = requests.post(url=job["model_url"],
                                     data=job["data"],
                                     headers=job["headers"])
            if response.ok:
                # If it was received
                print(f"NOTE: Successfully uploaded: {len(names)} images")

                # Add to active jobs
                active_jobs.append(response)
                active_imgs.append(names)

                # Remove from queued jobs
                queued_jobs.remove(job)
                queued_imgs.remove(names)

            else:
                # There was an error uploading to CoralNet; get the message
                message = json.loads(response.text)['errors'][0]['detail']

                # Print the message
                print(f"CoralNet: {message}")

                if "5 jobs active" in message:
                    print(f"NOTE: Will attempt again at {in_N_seconds(patience)}")
                    time.sleep(patience)

                else:
                    # Assumed that the images have expired
                    print(f"ERROR: Failed to upload: {len(names)} images")

                    # Add to expired images
                    expired_imgs.extend(names)

                    # Remove from queue
                    queued_jobs.remove(job)
                    queued_imgs.remove(names)

        # If all images have expired, break from the loop
        if IMAGES['image_name'].isin(expired_imgs).all():
            print("NOTE: All images have expired")
            break

    # Check the status of the active jobs, break when another can be added
    while len(active_jobs) <= 5 and len(active_jobs) != 0:

        # Check the status of the active jobs
        print_job_status(queued_jobs, active_jobs, completed_jobs, expired_imgs)

        # Sleep before checking status again
        print(f"\nNOTE: Checking status again at {in_N_seconds(patience)}")
        time.sleep(patience)

        # Loop through the active jobs
        for i, (job, names) in enumerate(list(zip(active_jobs, active_imgs))):

            # Check the status of the current job
            current_status, message, wait = check_job_status(job, CORALNET_TOKEN)

            # Print the message
            print(f"NOTE: {message}")

            # Current job has finished, output the results, remove from queue
            if message == "Completed Job":

                # Convert to csv, and save locally, check expired
                predictions, expired = convert_to_csv(current_status,
                                                      names,
                                                      SOURCE_PREDICTIONS)

                # Deal with images after the job has been completed
                for name in names: #
                    # If the image was in expired, add to expired
                    if name in expired:
                        expired_imgs.append(name)
                    # Else, add to completed_imgs
                    else:
                        completed_imgs.append(name)

                # Add to completed jobs list
                print(f"NOTE: Adding {len(names)} images to completed")
                completed_jobs.append(current_status)

                # Remove from active jobs, images list
                print(f"NOTE: Removing {len(names)} images from active")
                active_imgs.remove(names)
                active_jobs.remove(job)

            # Wait for the specified time before checking the status again
            time.sleep(wait)

        # After checking the current status, break if another can be added
        # Else wait and check the status of the active jobs again.
        if len(active_jobs) < 5 and len(queued_jobs) > 0:
            print(f"NOTE: Active jobs {len(active_jobs)}; adding another.")
            break

    # Check to see everything has been completed, breaking the loop
    if not queued_jobs and not active_jobs and not expired_imgs:
        print("NOTE: All images have been processed; exiting loop.")
        finished = True

    # If there are no queued jobs, and no active jobs, but there are images in
    # expired, get just the AWS URL for the expired images and update dataframe.
    if not queued_jobs and not active_jobs and expired_imgs:
        print(f"NOTE: Updating {len(expired_imgs)} expired images' URL")
        # Get the subset of images dataframe containing only the expired images
        IMAGES = IMAGES[IMAGES['image_name'].isin(expired_imgs)].copy()
        old_urls = IMAGES['image_page'].tolist()
        new_urls = get_image_urls(old_urls, USERNAME, PASSWORD)
        IMAGES['image_url'] = new_urls
        expired_imgs = []

NOTE: Image mcr_lter1_fringingreef_pole1-2_qu1_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu2_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu3_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu4_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu5_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu6_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu7_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole1-2_qu8_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole2-3_qu1_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole2-3_qu2_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole2-3_qu3_20080415.jpg not in queue; sampling
NOTE: Image mcr_lter1_fringingreef_pole2-3_qu4_20080415.jpg not in queue; sampling
NOTE