# Gathering Final Rule Text from Regulations.gov API

* * * * *

In [2]:
# Import required libraries
import requests
from urllib import quote_plus
import json
from __future__ import division
import math
import csv

## 1: API Keys

Get an API key from the [Regulations.gov](http://regulationsgov.github.io/developers/key/) github website. Set your key in the variable given below. 

Note: The default rate limit of 1,000 requests per hour applies to all Regulations.gov API users

In [3]:
# set key
key = "MEbdOvsBUDfzxpeR4Dxne1iIzy1WwW0g8xhufQKE"

## 2. Requesting Data

### 2.1 defining the `get_doc_from_api` function

The CFTC comments we have are associated with Federal Register references which are to proposed rules and use a different set of identifiers than the regulations.gov website.  In order to get the final rules which might cite the public comments, we need to turn CFTC's proposed rule FR numbers in to Regulations.gov `documentId` values.  We will do this with the regulations.gov API by passing it a CFTC FR reference and keyword searching the reference in the regulations.gov database AND filtering the results for only CFTC documents and only final rules.  

**Note**: I have written the function so that one could, in theory, ask for a different document type.  In this project, only "FR" will every be sent to the function.  However, other document types could be passed to the function, as listed below.

**Note**: This is poorly specified in the documentation on Regulations.gov's api, but the types of documents which can be called by `documentType` are a fixed list of values:

* N: Notice
* PR: Proposed Rule
* FR: Rule
* O: Other
* SR: Supporting & Related Material
* PS: Public Submission (NB: this is where Public Comments would live, but we do not need this for our project)

Since we want the final rule text in order to get citations, we will set up `get_api_data` to take a parameter which specifies one of these types of data and then pass it `"FR"`

In [29]:
def get_docs_from_api(comment_FR_ref, documentType):
    # set base url
    base_url="https://api.data.gov/regulations/v3/documents"

    # set response format
    response_format=".json"

    # set search parameters
    search_params = {"s":comment_FR_ref,
                     "api_key":key,
                     "a":"CFTC",
                     "dct":documentType
                    }

    # make request
    r = requests.get(base_url+response_format, params=search_params)
    
    # convert to a dictionary
    data=json.loads(r.text)
    
    # get number of "hits" (doc records returned by the search for the FR ref) 
    hits = data['totalNumRecords']
    print "There are " + str(hits) + " documents returned by \""+comment_FR_ref+"\""
    
    # make an empty list where we'll hold the document IDs for each of the docs returned by the search
    docIDs = [] 
    
    # get just the doc records from the documents API return, not the totalNumRecords object
    docRecords = data['documents']
    
    # pull out the docIDs for each doc
    for docRecord in docRecords:
        docIDs.append(docRecord['documentId'].encode("utf8"))
    
    # Make a dictionary to hold each of the full texts of the rules
    docFullTexts = []
    
    # now we're ready to loop through each of the document IDs and use a hacked version of the document (non plural) API to get the actual document
    for docID in docIDs:
        # using the URL pattern we recognized from the document API, we'll just construct the download URL manually.  This almost seems like a hidden "download" API.  Haxors!
        ##  Note that we are getting an HTML document, converting it to text, and then converting it to a string with utf8 encoding
        fullText = requests.get("https://api.data.gov/regulations/v3/download?api_key="+key+"&documentId="+docID+"&contentType=html").text.encode("utf8")
        
        # lets create a document to hold the full text and meta-data like document ID
        document = {}
        
        # Now we'll store that document in our dictionary using the documentID as the key
        document['documentId'] = docID
        document['Full Text'] = fullText
        document['Comment FR Reference'] = comment_FR_ref
        
        docFullTexts.append(document)
        
        
    # and we end by returning the list of documents stored as a dictionary for each document containing the full text and meta-data    
    return(docFullTexts)

In [33]:
# Testing
test = get_docs_from_api("75 FR 63732", "FR")
# We expect the three keys in the above function
print test[1].keys()
# We expect the length to match the status message printed during function execution
print len(test) 

There are 9 documents returned by "75 FR 63732"
['Full Text', 'Comment FR Reference', 'documentId']
9


In [265]:
print test[7]['documentId']
test[7]['Full Text']

CFTC-2011-0117-0001


'<html>\n<head>\n<title>Federal Register, Volume 76 Issue 170 (Thursday, September 1, 2011)</title>\n</head>\n<body><pre>\n[Federal Register Volume 76, Number 170 (Thursday, September 1, 2011)]\n[Rules and Regulations]\n[Pages 54538-54597]\nFrom the Federal Register Online via the Government Printing Office [<a href="http://www.gpo.gov">www.gpo.gov</a>]\n[FR Doc No: 2011-20817]\n\n\n\n[[Page 54537]]\n\nVol. 76\n\nThursday,\n\nNo. 170\n\nSeptember 1, 2011\n\nPart II\n\n\n\n\n\nCommodity Futures Trading Commission\n\n\n\n\n\n-----------------------------------------------------------------------\n\n\n\n\n\n17 CFR Part 49\n\n\n\n\n\nSwap Data Repositories: Registration Standards, Duties and Core \nPrinciples; Final Rule\n\n\x00\x00Federal Register / Vol. 76 , No. 170 / Thursday, September 1, 2011 / \nRules and Regulations\x00\x00\n\n[[Page 54538]]\n\n\n-----------------------------------------------------------------------\n\nCOMMODITY FUTURES TRADING COMMISSION\n\n17 CFR Part 49\n\nRIN 3

### 2.2 Processing the document's Full Text into just footnotes which contain citations

Now that we can get the full text of the rules, we need to search it for citations which contain references to comment letters.  These citations happen in footnotes, so we need an ## step process to get to our final goal: a dictionary which contains all citations to comment letters for a given rule.

We shall define a function, `get_citations_to_comments` which takes a regulations.gov `"Full Text"` and returns a list of comments.  It will: 

1. Take the full text and split it into:
 1. a list of body text blocks followed by a citation to a footnote 
 1. a list of footnotes
1. Itterate through the list of text blocks, and select only those which refer to "comment"
 1. for those which refer to comments, add the associated footnote to a list of "citations"
1. Return that list of citations as a list of strings

Note: Regulations.gov is AMAZINGLY inconsistent in their encoding of HTML for the full text with lots of little exceptions and variations in format for the .htm document.  Thus, there are a large number of small fixes in the function to handle these cases.  Each follows a pattern rather than simply a single hard-code fix, but be careful changing anything without lots of testing to make sure that there are consisten results.  The code was originally debugged with documentId = CFTC-2013-0056-0001.  It works correctly.  Any changes need to not break the parsing of that document

In [270]:
def get_citations_to_comments(FullText):
 
    # get rid of the header information, all we want is the supplementary information section.
    #temp = FullText.split("SUPPLEMENTARY INFORMATION:")
    #FullText = temp[1]
    
    #change the ---- based delineator into something more unique to prevent problems with dashes in the actual text.
    FullText = FullText.replace("\n---------------------------------------------------------------------------\n\n",
                                "||~~Block~Separator~~||")
    
    # fix the problem with italics and footnotes causing body and footnote blocks to lack a --- separator and 
    ## instead have only a blank line separator
    FullText = FullText.replace("\n\n    \\","|~~TEMP~FOOTNOTER~~||") #first preserve the footnote pattern
    FullText = FullText.replace("\n\n"," ||~~Block~Separator~~||") #the fix the irregular block delineator problem
    FullText = FullText.replace("|~~TEMP~FOOTNOTER~~||","\n\n    \\") #now replace the footnote pattern
    
    #fix the problem with a footnote at the end of a line which does not actually end the block.
    FullText = FullText.replace("\\\n", "\\ \n")
    
    #get rid of page breaks
    FullText = FullText.replace("\n\n[["," [[")
    FullText = FullText.replace("]]\n\n","]] ")
    
    #Paragraphs are block changes too, so let's get that marked
    FullText = FullText.replace("\n\n","||~~Block~Separator~~||") ###############################################<------
     
    # Un-Wrap all the lines in the FullText to fix the silly line truncations
    streamOfText = FullText.replace("\n","")
    

    

    
    # Split into blocks, some of which are footnotes and some of which are body content.  
    ##  Use the special delineator we added earlier
    blocks = streamOfText.split("||~~Block~Separator~~||")
    '''
    print blocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[2]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[3]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''    
    #Get rid of the blank blocks and those with just a single space
    blocks = filter(None,blocks)
    
    #Separate out the body content and footnote blocks based on the pattern that footnote blocks ALWAYS start with a "\\#\\" 
    bodyBlocks = []
    footnoteBlocks = []
    
    for block in blocks:
        if block.startswith("    \\"):
            footnoteBlocks.append(block)
        else :
            bodyBlocks.append(block)
     
    
    '''    
    print "##################### Body Blocks #######################"
    print bodyBlocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print bodyBlocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print "##################### Footnote Blocks #######################"
    print footnoteBlocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print footnoteBlocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''    
    #Split the footnote blocks into individual footnotes
    footnotes = []
    for block in footnoteBlocks:
        # split out the footnotes
        tempFootnotes = block.split("    \\")
        
        # remove the paragraph placeholder from the beginning of the first footnote in a group
        tempFootnotes[0] = tempFootnotes[0].replace("|~P~|","",1)
        
        #add the footnotes to the main list
        footnotes = footnotes + tempFootnotes[1:]
    
    #split the body content blocks at the place where each footnote appears
    bodyContent = []
    for block in bodyBlocks:
        # split out each body content based on the end of the footnote reference
        block = block.replace(";"," ") # to fix cites to footnotes which are followed by a ;
        block = block.replace(":"," ") # to fix cites to footnotes which are followed by a ;
        block = block.replace(")"," ") # to fix cites to footnotes which are inside parantheses
        block = block.replace("-"," ") # to fix cites to footnotes followed by a -
        block = block.replace(",", " ") # to fix cites to footnotes which are followed by a ,
        block = block.replace("<SUP>","\\") # to fix cites to footnotes which were hacked to superscript rather than proper \\
        block = block.replace("</SUP>","\\") # to fix cites to footnotes which were hacked to superscript rather than proper \\
        if block.endswith("\\"): #to handle block which end with a citation to a footnote
            block = block+" "
        tempBodyContent = block.split("\\ ")
                
        #print bodyBlocks.index(block) ## debug
        
        #print tempBodyContent
        
        #print type(tempBodyContent)
        
        # filter out body content blocks without a footnote reference at the end
        if type(tempBodyContent) == list:
            tempBodyContent = [chunk for chunk in tempBodyContent if chunk.endswith(('0','1','2','3','4','5','6','7','8','9'))]
        elif not tempBodyContent.endswith(('0','1','2','3','4','5','6','7','8','9')):
            tempBodyContent = []
        
        #add the body content to the main list
        bodyContent = bodyContent + tempBodyContent
    
    #print "##################### Body Content ####################### Total: "+str(len(bodyContent))
    '''
    print bodyContent[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print bodyContent[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    #print "##################### Footnotes ########################## Total: "+str(len(footnotes))
    '''
    print footnotes[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print footnotes[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    #permanent status message intended for final use
    print "There were "+str(len(bodyContent))+" body content chunks and "+str(len(footnotes))+" footnotes found",
    
        
    
    if (len(footnotes)-len(bodyContent)<>0):
        print "\n  ##################### Body Content TESTER #######################"
        for chunk in bodyContent:
            if not chunk.rstrip().endswith(str(bodyContent.index(chunk)+1)):
                print " ~~# Error Chunk -2, index: " + str(bodyContent.index(chunk)-2) +"|| "+ bodyContent[bodyContent.index(chunk)-2]
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk -1, index: " + str(bodyContent.index(chunk)-1) +"|| "+ bodyContent[bodyContent.index(chunk)-1]
                print " ~~~~~~~~~~~~~~"
                print " ### ERROR CHUNK ###, index: " + str(bodyContent.index(chunk)) +"|| "+ chunk
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk +1, index: " + str(bodyContent.index(chunk)+1) +"|| "+ bodyContent[bodyContent.index(chunk)+1]
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk +2, index: " + str(bodyContent.index(chunk)+2) +"|| "+ bodyContent[bodyContent.index(chunk)+2]
                print " ~~~~~~~~~~~~~~"
                break
        print "  No alignment errors found"
    
    # search through the bodyContent for the word "comment".  If found, add the corresponding footnote to the citations list
    ## note: the index of the bodyContent and footnotes lists are synchronized AND that the index is the footnote number minus 1
    ##       i.e. bodyContent 1 references footnote 1 and both are at index 0 in their corresponding lists
    ## note: by walking through both lists simultaneously, we keep the footnotes which are added to "citations" in the order
    ##       that they appear in the text.  This could be useful for some other project, if we cared about order.
    citations = []
    
    for index in range(0,len(footnotes)):
        # test if the footnote talks about a comment and then add it to the list of citations
        if "comment" in footnotes[index].lower():
            citation = footnotes[index]
            citations.append(citation)
        # test if the body content chunk talks about a comment and then add the corresponding footnote to citations
        elif "comment" in bodyContent[index].lower():
            citation = footnotes[index]
            citations.append(citation)
    
    print "of which "+str(len(citations))+" contained citations"
    
    '''
    print "##################### Citations ##########################"
    print citations[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print citations[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    return(citations)

In [267]:
test2 = get_citations_to_comments(test[2]['Full Text'])
#test2[1:10]

There were 146 body content chunks and 146 footnotes found of which 35 contained citations


In [268]:
## test the function on a couple documents
for document in test:
    print test.index(document),
    get_citations_to_comments(document['Full Text'])

0 There were 326 body content chunks and 326 footnotes found of which 78 contained citations
1 There were 1153 body content chunks and 1153 footnotes found of which 580 contained citations
2 There were 146 body content chunks and 146 footnotes found of which 35 contained citations
3 There were 1472 body content chunks and 1472 footnotes found of which 573 contained citations
4 There were 623 body content chunks and 622 footnotes found 
  ##################### Body Content TESTER #######################
  No alignment errors found
of which 323 contained citations
5 There were 87 body content chunks and 87 footnotes found of which 16 contained citations
6 There were 341 body content chunks and 341 footnotes found of which 113 contained citations
7 There were 331 body content chunks and 331 footnotes found of which 109 contained citations
8 There were 20 body content chunks and 20 footnotes found of which 3 contained citations


## 3 Get commenters from list of citations using Stanford Name Entity Recongnizer (NER)



## 4 Create a dictionary with all commenters cited related to Dodd-Frank rules

Each of the citations will need to be stored as a dictionary entry which contains the following meta-data in addition to the 

In [6]:
all_duke = []
for i in years:
    all_duke.extend(get_api_data("Duke Ellington", i))

number of hits: 77
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
number of hits: 101
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10
number of hits: 111
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
collecting page 10
collecting page 11
number of hits: 99
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8
collecting page 9
number of hits: 114
collecting page 0
collecting page 1
collecting page 2
collecting page 3
collecting page 4
collecting page 5
collecting page 6
collecting page 7
collecting page 8

In [7]:
# test your code
len(all_duke) == 1043

True

In [8]:
all_duke[0]

{u'_id': u'4fd2872b8eb7c8105d858553',
 u'abstract': None,
 u'blog': [],
 u'byline': {u'original': u'By N.R. Kleinfield',
  u'person': [{u'firstname': u'N.',
    u'lastname': u'Kleinfield',
    u'middlename': u'R.',
    u'organization': u'',
    u'rank': 1,
    u'role': u'reported'}]},
 u'document_type': u'article',
 u'headline': {u'kicker': u'New York Bookshelf',
  u'main': u'NEW YORK BOOKSHELF/NONFICTION'},
 u'keywords': [{u'name': u'persons', u'value': u'ELLINGTON, DUKE'},
  {u'name': u'persons', u'value': u'HARRIS, DANIEL'}],
 u'lead_paragraph': u"A WIDOW'S WALK: A Memoir of 9/11 By Marian Fontana Simon & Schuster ($24, hardcover) Theresa and I walk into the Blue Ribbon, an expensive, trendy restaurant on Fifth Avenue in Park Slope. We sit at a banquette in the middle of the room and read the eclectic menu, my eyes instinctively scanning the prices for the least expensive item.",
 u'multimedia': [],
 u'news_desk': u'The City Weekly Desk',
 u'print_page': u'9',
 u'pub_date': u'2005-1

## 4. Formatting and Exporting

### 4.1 Collect more fields

In the cell below, I've pasted the code from lecture defining a function that accepts a list of unformatted documents returned by the API, and formats it into a clean list of dictionaries that contain keys for `id`, `headline`, and `date`.

Edit the function so that we include the `lead_paragraph` and `word_count` fields.

**HINT**: Some articles may not contain a lead_paragraph, in which case, it'll throw an error if you try to address this value (which doesn't exist.) You need to add a conditional statement that takes this into consideraiton. If

**HINT**: Add `.encode("utf8")` at the end of dictionary key lookups. You'll thank me later when you try to export your CSV.

**Advanced**: Add another key that returns a list of `keywords` associated with the article.

In [9]:
def format_articles(unformatted_docs):
    '''
    This function takes in a list of documents returned by the NYT api 
    and parses the documents into a list of formated dictionaries, 
    with 'id', 'header', and 'date' keys
    '''
    formatted = []
    for i in unformatted_docs:
        dic = {}
        dic['id'] = i['_id']
        dic['headline'] = i['headline']['main'].encode("utf8")
        dic['date'] = i['pub_date'][0:10] # cutting time of day.
        if i['lead_paragraph']:
            dic['lead_paragraph'] = i['lead_paragraph'].encode("utf8")
        dic['word_count'] = i['word_count']
        formatted.append(dic)
    return(formatted) 

### 3.2 Format `all_duke`

Using the function you made above, format the `all_duke` data. Store the result in an object called `all_duke_formatted`

In [10]:
all_duke_formatted = format_articles(all_duke)

In [11]:
# test you code
all_duke_formatted[0]

{'date': u'2005-10-02',
 'headline': 'NEW YORK BOOKSHELF/NONFICTION',
 'id': u'4fd2872b8eb7c8105d858553',
 'lead_paragraph': "A WIDOW'S WALK: A Memoir of 9/11 By Marian Fontana Simon & Schuster ($24, hardcover) Theresa and I walk into the Blue Ribbon, an expensive, trendy restaurant on Fifth Avenue in Park Slope. We sit at a banquette in the middle of the room and read the eclectic menu, my eyes instinctively scanning the prices for the least expensive item.",
 'word_count': 629}

### 3.3 Export as CSV

Export the object all_duke_formatted into a CSV file.

In [12]:
keys = all_duke_formatted[0]
#writing the rest
with open('allduke.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_duke_formatted)

## 4. Extra Credit / Bonus / Advanced / Optional

Import the data in R, and produce a graph that visualizes how Duke Ellington has changed in popularity over time.

See Assignment_7_R.R