# Running the NextRequest scraper
*Author: Steven Yuan*

This Jupyter notebook facilitates scraping data from NextRequest websites. **Use only for scraping; EDA goes in the `nextrequest-eda` notebook.** Currently, the scraper can only use the Firefox `geckodriver` - if more browser support is desired, please let me know. See the `nextrequest-selenium` notebook for an in-depth explanation of how the scraper works.

In [None]:
!pip install selenium
from selenium import webdriver
from nextrequest_scraper import NextRequestScraper

# Scraper options
options = webdriver.FirefoxOptions()
options.headless = True

Set your desired parameters and options here before running the scraper:

In [6]:
requests = [] # List of dictionaries containing scraped info on each request

# Scraper parameters
url = 'https://vallejo.nextrequest.com/requests/' # URL to scrape from - ensure that it contains a backslash at the end!
earliest_id = '16-1' # Earliest IDs in the databases
requests_name = 'vallejo_requests' # Name of CSV file and ZIP archive to export scraped data to

num_requests = -1 # Number of requests to scrape
wait_time = 0.1 # Implicit wait time i.e. time for WebDriver to spend to find a given element
timeout = 10 # Wait time between scraper runs in case of timeouts
progress = 5 # Show progress every N requests that are scraped

When you are sure that your parameters are set correctly, run the following cell:

In [8]:
# Instantiate the driver
driver = webdriver.Firefox(options=options)

# Scrape data!
scraper = NextRequestScraper(driver, url, wait_time=wait_time)
scraper.scrape(requests, earliest_id, requests_name=requests_name, 
               num_requests=num_requests, timeout=timeout, progress=progress)

# Close the driver
scraper.driver.close()

Iteration 1
-----------
Starting request: 20-283

Requests scraped: 5 	Avg runtime: 2.03s/request 	Total runtime: 10.1s
Requests scraped: 10 	Avg runtime: 2.18s/request 	Total runtime: 21.8s
Requests scraped: 15 	Avg runtime: 2.15s/request 	Total runtime: 32.2s
Requests scraped: 20 	Avg runtime: 2.14s/request 	Total runtime: 42.8s
Requests scraped: 25 	Avg runtime: 2.17s/request 	Total runtime: 54.3s
Requests scraped: 30 	Avg runtime: 2.26s/request 	Total runtime: 67.7s
Requests scraped: 35 	Avg runtime: 2.26s/request 	Total runtime: 79.1s
Requests scraped: 40 	Avg runtime: 2.31s/request 	Total runtime: 92.5s
Requests scraped: 45 	Avg runtime: 2.32s/request 	Total runtime: 104.3s
Requests scraped: 50 	Avg runtime: 2.33s/request 	Total runtime: 116.3s
Requests scraped: 55 	Avg runtime: 2.36s/request 	Total runtime: 129.8s
Requests scraped: 60 	Avg runtime: 2.34s/request 	Total runtime: 140.3s
Requests scraped: 65 	Avg runtime: 2.32s/request 	Total runtime: 151.0s
Requests scraped: 70 	A

Note that the scraper may stop scraping before all requests are reached. If this occurs, rerun the cell (currently working on a way to make the scraper completely automated). **Do not run the parameters cell if the scraper prematurely ends in this way. You will lose all progress that way as the `requests` list will be cleared.**