# Webscraper Overview

Hello. First, thank you for running this program. This program will take your CEO Plots, take photos of the individual dots, and will fill a folder (created by the program) of these images categorically. This program will do most of the work: what you guys will need to do is direct the program to the right plots. This will be done through the program, in the form of prompts (the program will ask you to enter some information, and after you enter it, the program will send that to the site behind the scenes). The pieces of information you will need to enter are:

1. Your CEO account information (username and password)
2. The name of the institution (Globe Observer, in our case)
3. The project you want to scrape (a list will be printed, and you'll pick the project from the list)
4. The place in your drive that the folder (of the images) will be created
5. Whether you would like to see your analyzed plots (plot you've marked) or others' plots (which will be unmarked)
6. Which plot you would like to start at (enter the ID of the plot you've like to start at)
7. How many plots you'd like to scrape

Some of the cells in the notebook will take a while to finish running. This is okay. There might also be some visual glitches, but those don't affect anything either. If the cells **explicitly** throw an error (ValueError, TypeError, etc.), then something has gone wrong (and you can either email me, ananthmadan03@gmail.com, or DM me).

Just press Runtime>Run All (or Ctrl-F9) and let the program run. After all of the prompts have been answered (after you've entered how many plots you want to scrape), you can let the program run in the background while you work on something else. If you notice the program has gone idle (the CO icon should turn red), simple save and reload the page.

We hope that you will use this to scrape both your AOI and centerplot data from CEO.

Thank you.

P.S. the amount of images generated by this process amounts to approximately 32 MB of data.

In [None]:
!pip install selenium

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.8MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [None]:
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-440
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  chromium-browser chromium-browser-l10n chromium-codecs-ffmpeg-extra
Suggested packages:
  webaccounts-chromium-extension unity-chromium-extension adobe-flashplugin
The following NEW packages will be installed:
  chromium-browser chromium-browser-l10n chromium-chromedriver
  chromium-codecs-ffmpeg-extra
0 upgraded, 4 newly installed, 0 to remove and 35 not upgraded.
Need to get 75.5 MB of archives.
After this operation, 256 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-codecs-ffmpeg-extra amd64 83.0.4103.61-0ubuntu0.18.04.1 [1,119 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 chromium-browser amd64 83.0.410

In [None]:
import sys
sys.path.insert(0, '/usr/lib/chromium-browser/chromedriver')

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, UnexpectedAlertPresentException, NoAlertPresentException
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

import re

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
options = webdriver.ChromeOptions()
options.add_argument('--headless') # Disable GUIs for colab compatibility
options.add_argument('--no-sandbox') # Disable sandbox for third-party software to run
options.add_argument('--disable-dev-shm-usage') # Disable virtual memory storage

driver = webdriver.Chrome('chromedriver', options=options)
driver.get('https://collect.earth/login')
driver.set_window_size(1920,1200) #driver.maximize_window() was here before, it cut off part of the square and that's not very nice of them

# Login

In [None]:
class ChangeToValidURL:
    
    def __init__(self, valid=[]):
        self.valid_urls = valid
    
    def __call__(self, driver):
        try:
            return driver.current_url in self.valid_urls
        except UnexpectedAlertPresentException:
            driver.switch_to.alert.accept() # Sometimes raises a NoAlertPresentException, I don't know why
            raise TimeoutException

Enter your email and password when prompted. This signs you in to your collect earth account so you can access your plots.

In [None]:
from IPython.display import clear_output

# Generator for retry prompts
retry_prompts = iter([

    'you have two more chances',
    'one more chance',
    'Max amount of tries met.'
])

email_field = driver.find_element_by_xpath('//*[(@id = "email")]')
password_field = driver.find_element_by_xpath('//*[(@id = "password")]')

# Attempt login (3 times)
while True:

    # Prompt user for information
    email = input('Enter email: ')
    password = input('Enter password: ')

    email_field.send_keys(email) # Enter email
    password_field.send_keys(password) # Enter password

    clear_output(False) # Clear user information

    driver.find_element_by_xpath(
        '//*[contains(concat( " ", @class, " " ), concat( " ", "align-items-center", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "bg-lightgreen", " " ))]'
    ).click() # Click login button

    # If the page changes (login has succeeds)
    try:
        
        WebDriverWait(driver, 2).until(
            ChangeToValidURL(['https://collect.earth/home'])
        )

        print('Login succeeded')
        break
        
    # Otherwise, prompt user to retry login
    except (TimeoutException, NoAlertPresentException):

        retry_prompt = next(retry_prompts)

        # If the user has attempted login unsuccessfully 3 times
        if retry_prompt == 'Max amount of tries met.':

            print(retry_prompt)
            driver.close() # Close driver
            sys.exit(0)
        
        else:

            print('Incorrect login information.')
            print(f'Please, try again ({retry_prompt})')

            # Clear email and password fields
            email_field.send_keys(Keys.CONTROL + 'a')
            email_field.send_keys(Keys.DELETE)
            password_field.send_keys(Keys.CONTROL + 'a')
            password_field.send_keys(Keys.DELETE)

Login succeeded


# Machinated Human-like Search

Option to go through the site like a human would (clicking buttons, filling out forms, etc.) to find the desired project

This cell allows you to filter the institutions by entering what you type into the Collect Earth filter. Select an institution from what is filtered by entering the number.

In [None]:
filter_elem = driver.find_element_by_xpath('//*[(@id = "filterInstitution")]')
while True:

    repeat = False
    filter_str = input('Enter institution name/filter (string): ')
    filter_elem.send_keys(filter_str)

    inst_dict = dict(enumerate(driver.find_elements_by_xpath("//p[contains(@class,'tree_label text-white')]"), 1))

    while True:

        clear_output(True)
        for i, inst in inst_dict.items():
            print('{0}. {1}'.format(i, inst.text.strip('\nⓘ'))) # Terrible to put this in the loop, but whatever

        choice = input('Select an institution (enter the number).\nIf you want to re-filter, enter nothing: ')

        if choice == '':

            filter_elem.send_keys(Keys.CONTROL + 'a')
            repeat = True
            break

        try:

            assert choice.isdigit() # User input must be a number
            choice = int(choice)
            assert choice >= 1 and choice <= len(inst_dict) # User input must be within range

            break

        except AssertionError:

            print(f'Invalid choice (must be a digit from 1 to {len(inst_dict)}')

    if not repeat:

        inst_dict[choice].find_element_by_xpath("//a[contains(@class,'institution_info btn')]").click() # Click the institution button
        break

1. GLOBE Observer
2. The GLOBE Program
Select an institution (enter the number).
If you want to re-filter, enter nothing: 1


Here, select the project from the project list by again entering the number corresponding to the project.

In [None]:
# Get a list of all of the projects (found using xpath)
project_list = WebDriverWait(driver, 2).until(
    
    EC.presence_of_all_elements_located((By.XPATH, '//*[contains(concat( " ", @class, " " ), concat( " ", "d-flex", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "text-truncate", " " ))]'))
)
project_dict = dict(enumerate(project_list, 1)) # Dictionary holding projects and a numerical ID

# Print project and ID
for i, project in project_dict.items():
    name = project.get_attribute('innerHTML') # Only contains project name
    print(f'{i}. {name}')

while True:
    
    # Prompt user for project select
    choice = input(f'Which project would you like to scrape (1 to {len(project_list)})? ')

    try:

        assert choice.isdigit() # User input must be a number
        choice = int(choice)
        assert choice >= 1 and choice <= len(project_list) # User input must be within range
    
    except AssertionError:

        print(f'Invalid choice (must be a digit from 1 to {len(project_list)}')
        continue
    
    # Click on the chosen project
    chosen = project_dict[choice]
    proj_name = chosen.get_attribute('innerHTML')
    chosen.click()
    break

1. GO LandChallenge part 1 (Sept13-Sept29)
2. GO LandChallenge part 1 BARREN OR NOT(Sept13-Sept29)
3. MHM-Peru-GMM-EarthDay
4. SEES2020-AOI (1-56)
5. SEES2020-AOI_centerPlot (1-56)
6. SEES2020-AOI (57-76)
7. SEES2020-AOI_centerPlot (57-76)
8. SEES2020-AOI (1-102)
9. SEES2020-AOI_centerPlot (1-102)
10. SEES2020-AOI (1-108)
11. SEES2020-AOI_centerPlot (1-108)
12. SEES2020-AOI_1to56_36GridOfInterest
13. SEES2020-AOI_57to108_36GridOfInterest
14. SEES2020-AOI_centerPlot (1-118)
15. SEES2020-AOI (1-118)
16. SEES2020-AOI_109to118_36GridOfInterest
17. SEES2020-AOI_centerPlot (1-118)_copy
18. SEES2020-AOI_109to118_36GridOfInterest_copy
Which project would you like to scrape (1 to 18)? 13


# Plot Analysis



In [None]:
import os

In [None]:
while True:

    folder_path = input('Where would you like to save the plot images (must be a valid path name).\nIf you want to save at root, enter nothing: ')

    if folder_path != '':
        folder_path = f'/content/gdrive/{folder_path}/CEO_Plots/{proj_name}'
    else:
        folder_path = f'/content/gdrive/My Drive/CEO_Plots/{proj_name}'

    clear_output(True)

    try:

        if not os.path.isdir(folder_path):
            os.makedirs(folder_path)
            print(f'Made directory {folder_path}')
        else:
            print(f'Directory {folder_path} already exists.\nDefaulting to existing location.')

        break

    except OSError as e:
        print(f'Invalid path name: {e.args}')

Made directory /content/gdrive/My Drive/Saravana/CEO_Plots/SEES2020-AOI_57to108_36GridOfInterest


When prompted, enter yes to scrape your analyzed plots.

In [None]:
prompt = 'Would you like to see your analyzed plots (yes or no)? '
valid_yes = 'yes'
valid_no = 'no'
while True:
    
    # Prompt user
    review = input(prompt).lower()
    
    if review != valid_yes and review != valid_no:

        print(f'Invalid answer (must be: {valid_yes} or {valid_no})')
        clear_output(True) # For readibility
        continue
    
    if review == '2':
        
        driver.close()
        sys.exit(0)

    # Accept different versions of 'yes'
    # The worst try-except block I've ever seen
    try:
        
        if review == valid_yes:
            
            if valid_yes == '1':
                
                driver.find_element_by_xpath(
        
                    '//*[contains(concat( " ", @class, " " ), concat( " ", "mt-2", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "btn-outline-darkgray", " " ))]'
                
                ).click()
            
            # Click on the 'review analyzed plots' form
            WebDriverWait(driver, 2).until(
            
                EC.element_to_be_clickable((By.CLASS_NAME, 'form-check'))
            
            ).click()

        WebDriverWait(driver, 5).until(
        
            EC.presence_of_element_located((By.ID, 'go-to-first-plot'))
        
        ).click()
        
        # Find the plot top select button (for plot identification)
        # if it appears within the first 5 seconds
        # Otherwise, a TimeoutError will be thrown
        # This still needs a handler
        top_slct_btn = WebDriverWait(driver, 20).until(
            
            EC.element_to_be_clickable((By.XPATH, '//*[(@id = "top-select")]'))
        
        )
        str_id = top_slct_btn.get_attribute('innerHTML') # Contains only the plot ID number
        
        break
    
    except TimeoutException as timeout:
        
        print(f'Element could not be located: {timeout}')
        continue
        
    except UnexpectedAlertPresentException as alert:
        
        print(re.search('(?<={).*(?=})', str(alert)).group()) # Alert message
        
        # New user prompt
        prompt = '''Would you like to:\n
                 1. Continue and instead view unanalyzed plots?\n
                 2. Exit program?\n'''
                 
        valid_yes = '1'
        valid_no = '2'

Would you like to see your analyzed plots (yes or no)? yes


In [None]:
# Re-click plot navigation button to re-open plot navigation submenu
driver.find_element_by_xpath(
    "(//h3[contains(@class,'ml-3 btn')])[1]"
).click()

In [None]:
class ElementChange:

    def __init__(self, driver, locator, attr='innerHTML'):
        self.locator = locator
        self.attr = attr
        self.start_attr = EC._find_element(driver, locator).get_attribute(attr)

    def __call__(self, driver):

        try:
            return self.start_attr != EC._find_element(driver, self.locator).get_attribute(self.attr)
        except UnexpectedAlertPresentException:
            alert = driver.switch_to.alert
            print(alert.text)
            alert.accept()
            raise TimeoutException

Start at your first plot number. If the number given and your first plot match, press enter to start. Otherwise, you would need to enter your first plot.

In [None]:
while True:

    id_elem = driver.find_element_by_id('plotId')
    plt_id = re.search('\d+', driver.find_element_by_xpath("(//h3[@class='ml-2'])[1]").get_attribute('innerHTML')).group()

    plt_num = input(f'Enter the plot ID of the plot you\'d like to go to (integer).\n \
                    If you want to start at the current plot, enter nothing.\n \
                    (Current plot ID is {plt_id}): ')
    
    if plt_num == '':
        break

    try:
        id_elem.send_keys((Keys.CONTROL + 'a'))
        id_elem.send_keys(plt_num)
        driver.find_element_by_xpath("(//button[@class='btn btn-outline-lightgreen'])[2]").click()

        WebDriverWait(driver, 10).until(
            ElementChange(driver, (By.XPATH, "(//h3[@class='ml-2'])[1]"))
        )

        plt_id = re.search('\d+', driver.find_element_by_xpath("(//h3[@class='ml-2'])[1]").get_attribute('innerHTML')).group()

        print(f'Current plot is now: {plt_id}')

        break
    
    except (TimeoutException, NoAlertPresentException):

        clear_output(True)

Enter the plot ID of the plot you'd like to go to (integer).
                     If you want to start at the current plot, enter nothing.
                     (Current plot ID is 7801): 


In [None]:
lc_elems = driver.find_elements_by_xpath(

    '//*[contains(concat( " ", @class, " " ), concat( " ", "pl-1", " " ))]'

)

color_to_lc = dict(zip(
    
    list(map(
        
        lambda x: tuple(map(int, x.split(', '))),
        [
            re.search('(?<=rgb\().*(?=\);)', lc_elem.get_attribute('innerHTML')).group()
            for lc_elem in lc_elems
        ]
    )),
    
    list(map(
        
        lambda x: x.replace('&gt;', '>').replace('/', '_or_').title(),
        [
            re.search('(?<=<span class="small">).*(?=</span)', lc_elem.get_attribute('innerHTML')).group()
            for lc_elem in lc_elems
        ]
    ))
))


In [None]:
for land_cover in color_to_lc.values():

    try:
        os.mkdir(f'{folder_path}/{land_cover}')
        print(f'Created subdirectory {land_cover}')
    except OSError:
        print(f'Subdirectory {land_cover} already exists.\nDefaulting to existing subdirectory.')

Created subdirectory Trees_Canopycover
Created subdirectory Bush_Or_Scrub
Created subdirectory Grass
Created subdirectory Cultivated Vegetation
Created subdirectory Water>Treated Pool
Created subdirectory Water>Lake_Or_Ponded_Or_Container
Created subdirectory Water>Rivers_Or_Stream
Created subdirectory Water>Irrigation Ditch
Created subdirectory Shadow
Created subdirectory Unknown
Created subdirectory Bare Ground
Created subdirectory Building
Created subdirectory Impervious Surface (No Building)


In [None]:
import time

In [None]:
zoom_in = driver.find_element_by_xpath(
    '//*[contains(concat( " ", @class, " " ), concat( " ", "ol-zoom-in", " " ))]'
)
for _ in range(4):
    zoom_in.click()
    time.sleep(0.1)

In [None]:
import numpy as np
import cv2
import matplotlib.pyplot as plt
from google.colab.patches import cv2_imshow

In [None]:
bgr_set = set([tuple(reversed(x)) for x in color_to_lc])

In [None]:
def bytes_to_image(bytes_):
    array = np.frombuffer(bytes_, dtype=np.uint8)
    return cv2.imdecode(array, cv2.IMREAD_ANYCOLOR)

In [None]:
def get_plot_image(driver, pos, size, pad=(200, 50)):
    # This might need padding added to the plot size to shrink to overall plot size down
    # so the chance that cv classifies a random circle that isn't a circle goes down
    return bytes_to_image(driver.get_screenshot_as_png())[pos[1]+pad[1]:pos[1]+size[1]-pad[1], pos[0]+pad[0]:pos[0]+size[0]-pad[0]]

In [None]:
def move_sequence(driver, elem, dx_dy):
# I got it working. It now can click and drag on the screen
    actions = ActionChains(driver)

    actions.move_to_element_with_offset(elem, half_size[0]-10, half_size[1]-10)
    actions.click_and_hold()
    actions.move_by_offset(10, 10) # This doesn't do anything movement-wise; I assume it just updates the GUI
                                   # Whatever it does, it works. Thank you Larry

    actions.move_by_offset(*dx_dy) # We fancy out here with these positional args
    actions.release()
    actions.perform()

In [None]:
# Move the canvas distances that would normally be out of bounds
# by sub-dividing the distance into chunks that lie within bounds
# And then moving by the total offset at the end
def move_canvas(driver, elem, half_size, dx_dy, pad=15):

    signs = np.sign(dx_dy)
    dx_dy = np.abs(dx_dy)

    size = tuple(map(lambda x: x-pad, half_size))

    mod_dict = dict([(d//s, [d, s]) for s, d in zip(size, dx_dy)])

    i = max(mod_dict)
    d, s = mod_dict[i]

    if i != 0:

        if d == dx_dy[0]:

            step = np.multiply(signs, (s, dx_dy[1]//i))
            offset = np.multiply(signs, (d%s, dx_dy[1]-i*dx_dy[1]//i))

        else:

            step = np.multiply(signs, (dx_dy[0]//i, s))
            offset = np.multiply(signs, (dx_dy[0]-i*dx_dy[0]//i, d%s))

    else:

        step = (0, 0)
        offset = np.multiply(signs, dx_dy)

    for _ in range(i):
        move_sequence(driver, elem, step)
    move_sequence(driver, elem, offset)
    ActionChains(driver).move_by_offset(-offset[0], -offset[1])

In [None]:
# Code to filter image by specific BGR
def filter_by_color(im, bgr, bound=1):
    
    lower = np.clip(np.array(bgr) - bound, 0, 255)
    upper = np.clip(np.array(bgr) + bound, 0, 255)
    
    mask = cv2.inRange(im, lower, upper)
    im_filtered = cv2.bitwise_and(im, im, mask=mask)
    
    return im_filtered

In [None]:
# Code to find circles in pictures
def cv_method(im, bgr_set):

    bgr_iter = iter(bgr_set)

    for i in range(len(bgr_set) + 1):

        if i == 0:
            im_base = im
        else:
            im_base = filter_by_color(im, np.array(next(bgr_iter)))

        im_gray = cv2.cvtColor(im_base, cv2.COLOR_RGB2GRAY)

        # You are welcome to play with the numbers
        circles = cv2.HoughCircles(
            
            image = im_gray,
            method = cv2.HOUGH_GRADIENT,
            dp = 1,
            minDist = im_gray.shape[0],
            param1 = 50,
            param2 = 10,
            minRadius = 6,
            maxRadius = 8
        
        )
        
        if circles is not None:
            return [tuple(coord) for coord in np.uint16(np.around(circles))[0, :, :2]]

    return None

In [None]:
from scipy.stats import mode
from scipy.ndimage.measurements import center_of_mass

In [None]:
def kernel_find(im, color, bounds=(80, 190)):
        
        binary_mask = (im == color).all(axis=-1).astype(int)
        bin_sum = np.sum(binary_mask)
        if bin_sum > bounds[0] and bin_sum < bounds[1]:
            return np.uint16(np.around(center_of_mass(binary_mask)))
        else:
            return None, None

In [None]:
# Uses a kernel to find the dot as accurately as possible
def kernel_method(plot, bgr_set, kernel_size=[25, 25], c_offset=[0, 0], threshold=(3, 3), min_size=[11, 11], num_zooms=3):
    
    im = plot 
    y, x = im.shape[:2]

    kernel_size[0] = min(x, kernel_size[0])
    kernel_size[1] = min(y, kernel_size[1])

    if [x, y] == kernel_size:
        kernel_size = None

    y //= 2
    x //= 2

    x_lower, y_lower = 0, 0

    maj_color = None
    
    for i in range(num_zooms):
        
        if kernel_size is not None:
            
            y, x = y + c_offset[1], x + c_offset[0]
            
            lkby = np.floor(kernel_size[1]/2).astype(int)
            ukby = kernel_size[1] - lkby
            lkbx = np.floor(kernel_size[0]/2).astype(int)
            ukbx = kernel_size[0] - lkbx
            
            y_lower = y-lkby if y >= kernel_size[1] else 0
            x_lower = x-lkbx if x >= kernel_size[0] else 0
            
            im = plot[y_lower:y+ukby, x_lower:x+ukbx]

            print(f'k_size: {kernel_size}')
            print(f'k_size: {im.shape}')
            cv2_imshow(im)

        if maj_color is None:

            pixels = im.reshape(im.shape[0]*im.shape[1], 3)
            mask = [p for p in pixels if tuple(p) in bgr_set]
            if len(mask) == 0:
                return None
            else:
                maj_color = mode(mask)[0]

        x_coord, y_coord = kernel_find(im, maj_color)

        print(f'(x, y): {x_coord, y_coord}')

        if x_coord is None:
            return None
            
        coords = (x_coord + x_lower, y_coord + y_lower)
        dist = np.subtract((x, y), coords)

        print(f'dist: {dist}')
        
        if (np.abs(dist) <= threshold).all():

            return [coords]

        else:

            c_offset[0], c_offset[1] = dist[1], -dist[0]

            if kernel_size == min_size:
                return coords

            if kernel_size is None:
                kernel_size = (x, y)

            else:
                if kernel_size[0] >= min_size[0]:
                    k_size_x = kernel_size[0]//2 if kernel_size[0]>=2*min_size[0] else min_size[0]
                    kernel_size[0] = k_size_x

                if kernel_size[1] >= min_size[1]:
                    k_size_y = kernel_size[1]//2 if kernel_size[1]>=2*min_size[1] else min_size[1]
                    kernel_size[1] = k_size_y

    return None

In [None]:
def find_circles(im, bgr_set):

    circles = kernel_method(im, bgr_set)
    if circles is not None:
        return circles
    else:
        return cv_method(im, bgr_set)

In [None]:
# Code to find yellow boundary lines in the screenshot
def find_lines(im, line_color=np.array([0, 255, 255])):
    
    im_filtered = filter_by_color(im, line_color)
    
    im_gray = cv2.cvtColor(im_filtered, cv2.COLOR_BGR2GRAY)
    edges = cv2.Canny(im_gray, 25, 100)
    
    lines = cv2.HoughLinesP(
        
        image = edges,
        rho = 1,
        theta = np.pi / 180,
        threshold = 30,
        minLineLength = 350,
        maxLineGap = 20
    )
    
    return lines

In [None]:
# Get distance in-between points in pixels
# Switching kernel and cv methods would result in a faster, albeit less accurate, function
def get_pt_dist(driver, elem, pos, size, half_size, rgb_set, x_step=20, max_range=1500, max_x_var=5, max_y_var=5):

    assert x_step <= half_size[0]*.75 # Just as a safe guard for kernel implementation

    ref_coord = find_circles(get_plot_image(driver, pos, size), bgr_set)[0]

    circles = kernel_method(get_plot_image(driver, pos, size), bgr_set, [half_size[0], 25], [half_size[0]//2, 0])
    if circles is not None:
        x, y = circles[0]
        if abs(y - ref_coord[1]) <= max_y_var and x != (ref_coord[0] + x_dist):
            move_canvas(driver, elem, half_size, (ref_coord[0] - x, 0))
            return ref_coord[0] - x + x_dist, ref_coord

    x_dist = 0
    while x_dist <= max_range:

        move_canvas(driver, elem, half_size, (x_step, 0))
        x_dist += x_step
        im = get_plot_image(driver, pos, size)
        circles = kernel_method(im, bgr_set, [x_step, 25], [x_step//2-half_size[0], 0])
        print(circles)
        cv2_imshow(im)
        if circles is not None:
            # Guarunteed to be a different circle, because of kernel implementation
            x, _ = circles[0]
            move_canvas(driver, elem, half_size, (ref_coord[0] - x, 0))
            return ref_coord[0] - x + x_dist, ref_coord

    return

    print('Unable to find next dot using kernel implementation')
    print('Falling back using cv implementation')
    
    while x_dist >= 0:

        move_canvas(driver, elem, half_size, (-x_step, 0))
        x_dist -= x_step

        circles = cv_method(get_plot_image(driver, pos, size), bgr_set)

        if circles is not None:
            for x, y in circles:
                if abs(y - ref_coord[1]) <= max_y_var and abs(x - ref_coord[0] - x_dist) >= max_x_var:
                    move_canvas(driver, elem, half_size, (ref_coord[0] - x, 0)) # Actually move to the next dot
                    return ref_coord[0] - x + x_dist, ref_coord

    return None

In [None]:
# Gets the width of the plot in terms of number of points
# Please clear the plot output afterwards to minimize confusion
# between this cell and the actual plotting cell
def get_plot_width(driver, elem, pos, size, half_size, step_size, bgr_set):

    step = (step_size, 0)

    # Move to the first boundary
    lines = find_lines(get_plot_image(driver, pos, size))
    while lines is None:
        move_canvas(driver, elem, half_size, step)
        lines = find_lines(get_plot_image(driver, pos, size))

    lines = None
    reverse_step = np.negative(step)
    num_dots = 1
    while lines is None:
        move_canvas(driver, elem, half_size, reverse_step)
        num_dots += 1
        lines = find_lines(get_plot_image(driver, pos, size))

    return num_dots

In [None]:
def get_unloaded_sum(im, unloaded_color=(52, 115, 117)):
    return (im == unloaded_color).all(axis=-1).astype(int).sum()

In [None]:
# Returns cropped screenshot and its corresponding label
def crop_and_label(im, coord, bgr_set, box_size=None, do_kernel=False, kernel_size=(18, 18), kernel_gap=(6, 6)):
    y, x = coord[:2] # For safety

    for i in range(3):

        # Invoke find_circles functions
        if i == 2:
        
            y_circ, x_circ = find_circles(im, bgr_set)[0]
            color = im[y_circ, x_circ]

        # Sparse kernel to get potential dot locations at several points
        elif do_kernel:
            
            k_size_x, k_size_y = kernel_size
            lower_x, lower_y = k_size_x//2, k_size_y//2
            upper_x, upper_y = k_size_x - lower_x, k_size_y - lower_y

            pixels = im[y-lower_y:y+upper_y+1:kernel_gap[1], x-lower_x:x+upper_x+1:kernel_gap[0]]
            pixels = pixels.reshape(pixels.shape[0]*pixels.shape[1], 3)
            mask = [p for p in pixels if tuple(p) in bgr_set]
            if len(mask) > 0:
                color = mode(mask)[0][0]
            else:
                color = [None, None, None]

        # Get the color at the point location directly
        else:

            color = im[y, x]

        category = color_to_lc.get(tuple(reversed(color)), None) # If the inner circle color doesn't match any valid lc-color, assume it is unanswered
        if category is not None:
            break
        else:
            if i == 0:
                do_kernel = True
            if i == 2:
                return None, None
    
    if box_size is not None:
        y_lower = y-box_size[1] if y >= box_size[1] else 0
        x_lower = x-box_size[0] if x >= box_size[0] else 0
        # Arrays don't overflow in terms of slice indices' upper bounds
        # This is very bad programming, but it's less work for the same result.
        return im[y_lower:y+box_size[1], x_lower:x+box_size[0]], category
    else:
        return im, category

In [None]:
elem = driver.find_element_by_xpath('//canvas')
pos = tuple(elem.location.values())
size = tuple(reversed(tuple(elem.size.values())))
half_size = tuple(map(lambda x: x//2, size))

This cell finds the distance between each dot. The expect output should be somewhere in between 1060 and 1080. This cell might take some time; just let it run.

In [None]:
pt_dist, pt_loc = get_pt_dist(driver, elem, pos, size, half_size, bgr_set)
print(pt_dist)

This cell finds the width of the plot (in dots) using the distance between each point (found in the previous cell). This one might also take some time.

In [None]:
plt_width = get_plot_width(driver, elem, pos, size, half_size, pt_dist, bgr_set)
print(plt_width)

In [None]:
num_dots = plt_width**2
print(num_dots)

In [None]:
move_canvas(driver, elem, half_size, (0, pt_dist*(plt_width//2))) # Move to the top-right corner

Enter the amount of plots you would like to scrape. Please scrape as many as you can, 36 for most.

In [None]:
while True:

    num_plots = input('How many plots do you want to scrape.\nIf you want to scrape as many as possible, enter nothing: ')
    clear_output(True)
    
    if num_plots == '':
        num_plots = np.inf # I have mixed feelings about this
        break

    try:
        assert num_plots.isdigit()
        num_plots = int(num_plots)
        assert num_plots > 0
        break
    
    except AssertionError:
        print('Invalid input: input must be an integer greater than 0')

In [None]:
plt_id = re.search('"\d+"', id_elem.get_attribute('outerHTML')).group().strip('"')
start_id = plt_id
finished = 0
while True:
    # Snake traversal
    for i in range(num_dots):
        
        print(f'\rCurrently processing dot #{i+1}', end='', flush=True)
        # Wait until image is mostly loaded
        for j in range(8):
            plt_im = get_plot_image(driver, pos, size)
            
            if j == 0:
                print(f'\rStalling until dot {i+1} has loaded fully', end='', flush=True)

            if get_unloaded_sum(plt_im) <= 10 or j > 8:
                print(f'\rDot {i+1} has finished loading', end='', flush=True)
                break

        im, cat = crop_and_label(plt_im, pt_loc, bgr_set)
        if cat is None:
            print(f'\rPlot {plt_id} is unanalyzed')
            break

        im = cv2.resize(im, (60, 60), interpolation=cv2.INTER_AREA) # Down-scaling image

        im_name = f'{plt_id}_{i}'
        cv2.imwrite(f'{folder_path}/{cat}/{im_name}.png', im)

        if (i+1)%11 == 0:
            step = (pt_dist, 0)
        else:
            step = (0, pt_dist*(-1)**(i//11 + 1))

        move_canvas(driver, elem, half_size, step)

        print(f'\rFinished processing dot #{i+1}', end='', flush=True)

    print(f'\rFinished analyzing plot {plt_id}')
    finished += 1

    if plt_id == start_id or finished >= num_plots:
            print('Exiting')
            print(f'Plot Count: {finished}')
            print(f'Start Plot: {start_id}')
            print(f'End Plot: {plt_id}')
            break

    clear_output(True)
    
    for _ in range(2):

        try:
            
            driver.find_element_by_xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "mx-1", " " ))]').click() # Click the button to go to the next plot

            WebDriverWait(driver, 10).until(
                ElementChange(driver, (By.XPATH, "(//h3[@class='ml-2'])[1]"))
            )
            break
        
        except (TimeoutException, NoAlertPresentException):

            while True:

                end = False
                x = input('Would you like to continue (yes or no)? ').lower()

                if x == 'yes':

                    print('Switching analysis mode')
                    WebDriverWait(driver, 2).until(
                        EC.element_to_be_clickable((By.CLASS_NAME, 'form-check'))
                    ).click()
                    start_id = None # Reset the starting id
                    break

                elif x == 'no':

                    print('Ending')
                    end = True # Very ham-fisted, but works (I think)
                    break

                else:

                    print('Invalid response: must be "yes" or "no"')

            if end:
                break

    id_elem = driver.find_element_by_id('plotId')
    plt_id = re.search('"\d+"', id_elem.get_attribute('outerHTML')).group().strip('"')
    print(f'Now processing plot {plt_id}')

    if start_id is None:
        start_id = plt_id

    # Zoom in to dot level
    for _ in range(4):
        zoom_in.click()
        time.sleep(0.1)

    move_canvas(driver, elem, half_size, (-pt_dist*(plt_width//2), pt_dist*(plt_width//2)), 50) # Get to the top-right corner

If the program disconnects, reload the page. The program should be still working. If it does not appear to be so (the above cell has no dynamic output), then simply restart the program and resume at the current plot (if the program stopped at plot 4528, simply resume, via the initial plot selection cell, at 4528.)

In [None]:
driver.close()

After you are finished, share the folder of the plots with me (ananthmadan03@gmail.com).