# Running the NextRequest scraper
*Author: Steven Yuan*

This Jupyter notebook facilitates scraping data from NextRequest websites. **Use only for scraping; EDA goes in the `nextrequest_eda` notebook.** Currently, the scraper can only use the Firefox `geckodriver` - if more browser support is desired, please let me know. See the `nextrequest_scraper_dev` notebook for an in-depth explanation of how the scraper works.

## TODO
- Find a way to handle StaleElementReferenceException that avoids getting into an infinite loop - maybe try a global variable?
- Bug: CSS selector for the messages is not working
- Bug: Scraper cannot find any elements in Albuquerque NextRequest database despite the driver being able to access those elements outside of scraper code
- Write better documentation for `nextrequest_scraper`
- Improve scraper log formatting

In [15]:
# !pip install selenium
from selenium import webdriver
import nextrequest_scraper
from nextrequest_scraper import NextRequestScraper

# Scraper options
options = webdriver.FirefoxOptions()
options.headless = True

In [16]:
import importlib
importlib.reload(nextrequest_scraper)

<module 'nextrequest_scraper' from '/home/powerofapoint/notebooks/police-records-analysis/steven/scraper/nextrequest_scraper.py'>

Set your desired parameters and options here before running the scraper:

In [17]:
requests = []  # List of dictionaries containing scraped info on each request

# Scraper parameters
url = 'https://nextrequest.cabq.gov/requests/'  # URL to scrape from - make sure it contains a backslash at the end!
earliest_id = '15-22'  # Earliest ID in database, or ID to start scraping from
requests_name = 'cabq_requests'  # Name of CSV file and ZIP archive to export scraped data to
path = '../data/'  # Directory path to export the ZIP archive to - make sure it contains a backslash at the end!
log = -1  # Log file to write to, if desired. Set to a blank string to only log to console, -1 to automatically generate a log file

num_requests = -1 # Number of requests to scrape. Set to a negative value to scrape all requests
wait_time = 0.1 # Implicit wait time i.e. time for WebDriver to spend to find a given element
timeout = 5 # Wait time between scraper runs in case of timeouts
progress = 100 # Show progress every N requests that are scraped

When you are sure that your parameters are set correctly, run the following cell:

In [18]:
# Instantiate the driver
driver = webdriver.Firefox(options=options)

# Scrape data!
scraper = NextRequestScraper(driver, url, wait_time=wait_time)
num_requests -= scraper.scrape(requests, earliest_id, requests_name=requests_name, path=path,
                               num_requests=num_requests, timeout=timeout, progress=progress, log=log)

try:
    scraper.driver.close()
except:
    pass

Start time: 2022-04-27 23:41:19.722513

Iteration 1
-----------
Starting request: 15-22

Webdriver could not find element while scraping count 1

Traceback (most recent call last):
  File "/home/powerofapoint/notebooks/police-records-analysis/steven/scraper/nextrequest_scraper.py", line 168, in scrape_request
    request_id = self.driver.find_element(By.CLASS_NAME, 'request-title-text').text.split()[1][1:]  # Request ID
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 1248, in find_element
    return self.execute(Command.FIND_ELEMENT, {
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 425, in execute
    self.error_handler.check_response(response)
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 247, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSu

In [2]:
num_requests

NameError: name 'num_requests' is not defined

In [28]:
len(requests)

3726

Note that the scraper may stop scraping before all requests are reached. If this occurs, rerun the cell (currently working on a way to make the scraper completely automated). **Do not run the parameters cell if the scraper prematurely ends in this way. You will lose all progress that way as the `requests` list will be cleared.**