## prep

Read the Terms of Service

https://regulationsgov.github.io/developers/terms/

Step 1: request an API key

https://regulationsgov.github.io/developers/key/

Assign the key as a quoted string called `API_KEY`.

In [1]:
API_KEY = 'your api key here'

## search for documents

Regulations.gov publishes three separate APIs for retrieving data. [Full documentation is here](http://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0)

`documents.json`: search across all of regulations.gov for documents.

`document.json`: retrieve specific metadata about a single document using the FDMS Document ID.

`docket.json`: retrieve specific metadata about a single docket using the Docket ID.

We'll begin by preparing a request to the `documents.json` API. The minimum required payload is the `'api_key'` and the search term `'s'`.

In [2]:
big_search_url = 'https://api.data.gov/regulations/v3/documents.json'

search_payload = {
    'api_key': API_KEY,
    's':"benzene"
}

The json() portion of the response is a dictionary with two keys.

In [6]:
import requests
response = requests.get(big_search_url, params=search_payload)
documents = response.json()
documents.keys()

dict_keys(['documents', 'totalNumRecords'])

In [7]:
total_records = documents['totalNumRecords']
total_records

31062

*This is the total number of documents available on regulations.gov.*

You can confirm this number at the following link, which searches for *all* documents relating to `'python'`.

https://www.regulations.gov/searchResults?rpp=25&po=0&s=python

The `'totalNumRecords'` is different from the number of documents we got in this response because the API will only return 25 items at a time by default.

In [8]:
these_documents = documents['documents']
type(these_documents)

list

In [9]:
len(these_documents)

25

Here's the first entry. Pay attention to the `'documentId'` field. We'll use it later.

In [10]:
these_documents[0]

{'agencyAcronym': 'EPA',
 'allowLateComment': False,
 'attachmentCount': 0,
 'commentDueDate': None,
 'commentStartDate': None,
 'docketId': 'EPA-HQ-SFUND-1989-0005',
 'docketTitle': 'Reportable Quantity Adjustments Final Rule for Carcinogens; Proposed Rule 54 FR 33418 (8/14/89)',
 'docketType': 'Rulemaking',
 'documentId': 'EPA-HQ-SFUND-1989-0005-0314',
 'documentStatus': 'Posted',
 'documentType': 'Supporting & Related Material',
 'numberOfCommentsReceived': 0,
 'openForComment': False,
 'postedDate': '2017-04-05T00:00:00-04:00',
 'rin': 'Not Assigned',
 'summary': "Document Contents : ...y 1QR3 Agency ' svEPA Research and %%-iS-^ Development REPORTABLE QUANTITY DOCUMENT FOR <endeca_term>BENZENE</endeca_term>. PENTACHLORONITRO- Prepared for OFFICE OF SOLID WASTE AND EMERGENCY RESPONSE Prepared by...",
 'title': 'Research and Development Reportable Quality Document for Benzene, Pentachloronitrobenzene (82-68-8]  [102RQ-273C-3-1'}

## retrieve all records

Let's use the above search and retrieve only finalized rules 

To do this we need to do two things:
1. Set the document type `'dct='FR'` parameter.
2. Iteratively update the `'po'` (page number) parameter, starting at zero, until all documents are found.

In [11]:
search_payload['dct'] = 'FR'
response = requests.get(big_search_url, params=search_payload)
total_records = response.json()['totalNumRecords']
total_records

469

In [13]:
# collect all list-of-documents here
found_documents = []

#initialize the page number
search_payload['po'] = 0

while len(found_documents) < total_records:
    response = requests.get(big_search_url, params=search_payload)
    
    # docs is a list
    docs = response.json()['documents']
    # stitch all of the items on the larger list
    found_documents.extend(docs)
    
    # increment the page number
    search_payload['po'] += 1

We actually ended up with more records then we thought.

In [14]:
len(found_documents)

475

## retrieve more specific information

For each document let's try to find the link to the PDF. To do this we'll need to make a separate request to the `document.json` API using the `documentId` we found earlier.

The `document.json` returns a different set of information than above. Here's the full list of information.

http://regulationsgov.github.io/developers/fields/

In [15]:
doc_url = 'https://api.data.gov/regulations/v3/document.json'
doc_payload = {'api_key':API_KEY}

We need to iterate through all of the documents we found above, place the `'documentId'` value in the `document.json` payload and grab all values of `'fileFormats'`.

**Note**: after some trial-and-error some documents had `'attachments'` while some just had `'fileFormats'`. It's likely that more restrictive searching will produce a single document type and avoid having to search for both.

In [16]:
# save the links to the PDF files
pdf_links = []

for doc in found_documents:
    doc_payload['documentId'] = doc['documentId']
    doc_response = requests.get(doc_url, params=doc_payload)
    doc = doc_response.json()
    
    # the pdf_links should also store the expected filename
    # of the pdf as well. Perhaps in list-of-lists or tuple.
    
    if 'fileFormats' in doc:
        pdf_links.extend(doc['fileFormats'])
    if 'attachments' in doc:
        for f in doc['attachments']:
            if 'fileFormats' in f:
                pdf_links.extend(f['fileFormats'])

In [17]:
len(pdf_links)

804

Let's just save the first one do disk

In [18]:
first_pdf = pdf_links[0]
first_pdf

'https://api.data.gov/regulations/v3/download?documentId=EPA-HQ-OPP-2015-0376-0003&contentType=pdf'

This trick is just to extract the documentId from the PDF url.

In [19]:
from urllib.parse import urlparse
docId = urlparse(first_pdf).query.split('&')[0].split('=')[1]
docId

'EPA-HQ-OPP-2015-0376-0003'

If you wanted to save all of them just do the following in a loop.

**Note**: the API_KEY is necessary.

In [20]:
with open(docId+'.pdf', 'wb') as f:
    pdf = requests.get(first_pdf, params={'api_key':API_KEY})
    f.write(pdf.content)