# Running the NextRequest scraper
*Author: Steven Yuan*

This Jupyter notebook facilitates scraping data from NextRequest websites. **Use only for scraping; EDA goes in the `nextrequest-eda` notebook.** Currently, the scraper can only use the Firefox `geckodriver` - if more browser support is desired, please let me know. See the `nextrequest-selenium` notebook for an in-depth explanation of how the scraper works.

In [17]:
!pip install selenium
from selenium import webdriver
from nextrequest_scraper import NextRequestScraper

# Scraper options
options = webdriver.FirefoxOptions()
options.headless = True

You should consider upgrading via the '/anaconda/envs/py38_default/bin/python -m pip install --upgrade pip' command.[0m


Set your desired parameters and options here before running the scraper:

In [18]:
requests = [] # List of dictionaries containing scraped info on each request

# Scraper parameters
url = 'https://vallejo.nextrequest.com/requests/' # URL to scrape from - ensure that it contains a backslash at the end!
earliest_id = '16-1' # Earliest IDs in the databases
requests_name = 'test_requests_2' # Name of CSV file and ZIP archive to export scraped data to

num_requests = 100 # Number of requests to scrape
wait_time = 0.1 # Implicit wait time i.e. time for WebDriver to spend to find a given element
timeout = 10 # Wait time between scraper runs in case of timeouts
progress = 2 # Show progress every N requests that are scraped

When you are sure that your parameters are set correctly, run the following cell:

In [19]:
# Instantiate the driver
driver = webdriver.Firefox(options=options)

# Scrape data!
scraper = NextRequestScraper(driver, url, wait_time=wait_time)

try:
    num_requests -= scraper.scrape(requests, earliest_id, requests_name=requests_name, 
                                   num_requests=num_requests, timeout=timeout, progress=progress)
except:
    pass
finally:
    # Close the driver
    scraper.driver.close()

Iteration 1
-----------
Starting request: 16-1

Requests scraped: 2 	Avg runtime: 1.61s/request 	Total runtime: 3.2s
Exception occurred at count 3

Total requests scraped: 3 	Avg runtime: 1.56s/request 	Total runtime: 4.7s

Last request scraped: 16-15



Traceback (most recent call last):
  File "/home/powerofapoint/notebooks/police-records-analysis/steven/nextrequest_scraper.py", line 190, in scrape_request
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 77, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 710, in _execute
    return self._parent.execute(command, params)
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 423, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/remote_connection.py", line 333, in execute
    return self._request(command_info[0], url, body=data)
  File "/anaconda/envs/py38_default/lib/python3.8/site-packages/selenium/webdriver/remote/remote

MaxRetryError: HTTPConnectionPool(host='localhost', port=56259): Max retries exceeded with url: /session/e5d09fe2-7976-4953-bb13-505be9b543d0/window (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9bd5aea250>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [21]:
print(requests)

[{'id': '16-1', 'status': 'CLOSED', 'desc': 'I need my birth record and if possible open my birth mothers records and any other cps I was in as a child ', 'date': 'April 5, 2016 via web', 'depts': 'City Clerk’s Office', 'docs': None, 'poc': 'Dawn Abrahamson, City Clerk', 'msgs': 'title,item,time\n"Request Closed   Hide\nPublic","No Records: Other Agency\nWe do not have the records you requested. We suggest you submit a public records request to Solano County Assessor Recorder Vital Records Section.","April 5, 2016, 3:03pm by Dawn Abrahamson, City Clerk"\n"Request Published\nPublic",,"April 5, 2016, 2:58pm by Dawn Abrahamson, City Clerk"\n"Department Assignment\nPublic",City Clerk’s Office,"April 5, 2016, 1:24pm"\n"Request Opened\nPublic",Request received via web,"April 5, 2016, 1:24pm"\n'}, {'id': '16-3', 'status': 'CLOSED', 'desc': 'Address:\n408 Tennessee Street, Vallejo, CA 94590\nLooking for the following types of records for the Fire and Building departments, if avaialable:\n-HAZM

Note that the scraper may stop scraping before all requests are reached. If this occurs, rerun the cell (currently working on a way to make the scraper completely automated). **Do not run the parameters cell if the scraper prematurely ends in this way. You will lose all progress that way as the `requests` list will be cleared.**