# Gathering Final Rule Text from Regulations.gov API

* * * * *

In [1]:
# Import required libraries
import requests
from urllib import quote_plus
import json
from __future__ import division
import math
import csv
import sys
import ner
import os
import re
from urllib import urlopen
import re ## added regex after being forced to with the inconsistent use of superscript hack by the CFTC

## 1: API Keys

Get an API key from the [Regulations.gov](http://regulationsgov.github.io/developers/key/) github website. Set your key in the variable given below. 

Note: The default rate limit of 1,000 requests per hour applies to all Regulations.gov API users

In [8]:
# set key
key = "MEbdOvsBUDfzxpeR4Dxne1iIzy1WwW0g8xhufQKE"

## 2. Requesting Data

### 2.1 defining the `get_doc_from_api` function

The CFTC comments we have are associated with Federal Register references which are to proposed rules and use a different set of identifiers than the regulations.gov website.  In order to get the final rules which might cite the public comments, we need to turn CFTC's proposed rule FR numbers in to Regulations.gov `documentId` values.  We will do this with the regulations.gov API by passing it a CFTC FR reference and keyword searching the reference in the regulations.gov database AND filtering the results for only CFTC documents and only final rules.  

**Note**: I have written the function so that one could, in theory, ask for a different document type.  In this project, only "FR" will every be sent to the function.  However, other document types could be passed to the function, as listed below.

**Note**: This is poorly specified in the documentation on Regulations.gov's api, but the types of documents which can be called by `documentType` are a fixed list of values:

* N: Notice
* PR: Proposed Rule
* FR: Rule
* O: Other
* SR: Supporting & Related Material
* PS: Public Submission (NB: this is where Public Comments would live, but we do not need this for our project)

Since we want the final rule text in order to get citations, we will set up `get_api_data` to take a parameter which specifies one of these types of data and then pass it `"FR"`

In [64]:
def get_docs_from_api(comment_FR_ref, documentType):
    # set base url
    base_url="https://api.data.gov/regulations/v3/documents"

    # set response format
    response_format=".json"

    # set search parameters
    search_params = {"s":comment_FR_ref,
                     "api_key":key,
                     "a":"CFTC",
                     "dct":documentType
                    }

    # make request
    r = requests.get(base_url+response_format, params=search_params)
    
    # convert to a dictionary
    data=json.loads(r.text)
    
    # get number of "hits" (doc records returned by the search for the FR ref) 
    hits = data['totalNumRecords']
    print "There are " + str(hits) + " documents returned by \""+comment_FR_ref+"\""
    
    # Make a dictionary to hold each of the full texts of the rules
    docFullTexts = []
    
    if hits>0:
        # make an empty list where we'll hold the document IDs for each of the docs returned by the search
        docIDs = [] 

        # get just the doc records from the documents API return, not the totalNumRecords object
        docRecords = data['documents']

        # pull out the docIDs for each doc
        for docRecord in docRecords:
            docIDs.append(docRecord['documentId'].encode("utf8"))



        # now we're ready to loop through each of the document IDs and use a hacked version of the document (non plural) API to get the actual document
        for docID in docIDs:
            # using the URL pattern we recognized from the document API, we'll just construct the download URL manually.  This almost seems like a hidden "download" API.  Haxors!
            ##  Note that we are getting an HTML document, converting it to text, and then converting it to a string with utf8 encoding
            fullText = requests.get("https://api.data.gov/regulations/v3/download?api_key="+key+"&documentId="+docID+"&contentType=html").text.encode("utf8")

            # lets create a document to hold the full text and meta-data like document ID
            document = {}

            # Now we'll store that document in our dictionary using the documentID as the key
            document['documentId'] = docID
            document['Full Text'] = fullText
            document['Comment FR Reference'] = comment_FR_ref

            docFullTexts.append(document)


    # and we end by returning the list of documents stored as a dictionary for each document containing the full text and meta-data    
    return(docFullTexts)

In [72]:
# Testing
test = get_docs_from_api("75 FR 76139", "FR")
# We expect the three keys in the above function
print test[0].keys()
# We expect the length to match the status message printed during function execution
print len(test) 

There are 2 documents returned by "75 FR 76139"
['Full Text', 'Comment FR Reference', 'documentId']
2


In [73]:
print test[0]['documentId']
test[0]['Full Text']

CFTC-2013-0053-0001


'<html>\n<head>\n<title>Federal Register, Volume 78 Issue 105 (Friday, May 31, 2013)</title>\n</head>\n<body><pre>\n[Federal Register Volume 78, Number 105 (Friday, May 31, 2013)]\n[Rules and Regulations]\n[Pages 32865-32944]\nFrom the Federal Register Online via the Government Printing Office [<a href="http://www.gpo.gov">www.gpo.gov</a>]\n[FR Doc No: 2013-12133]\n\n\n\n[[Page 32865]]\n\nVol. 78\n\nFriday,\n\nNo. 105\n\nMay 31, 2013\n\nPart III\n\n\n\n\n\nCommodity Futures Trading Commission\n\n\n\n\n\n-----------------------------------------------------------------------\n\n\n\n\n\n17 CFR Part 43\n\n\n\n\n\nProcedures To Establish Appropriate Minimum Block Sizes for Large \nNotional Off-Facility Swaps and Block Trades; Final Rule\n\n\x00\x00Federal Register / Vol. 78, No. 105 / Friday, May 31, 2013 / Rules \nand Regulations\x00\x00\n\n[[Page 32866]]\n\n\n-----------------------------------------------------------------------\n\nCOMMODITY FUTURES TRADING COMMISSION\n\n17 CFR Part 43\

In [81]:
block = "These amending rules become effective in--and their costs and \nbenefits are considered relative to--the context of the conditions now \nin place under part 43. That is: all publicly reportable swap \ntransactions are currently subject to a time delay and are not publicly \nreported in real-time.<SUP>569 570</SUP> Unless otherwise indicated, \nthe Commission has looked to a non-financial end-user that already has \ndeveloped the technical capability and infrastructure necessary to \ncomply with the requirements set forth in part 43 as a reference entity \nfor estimating this rulemaking\'s direct costs under the assumption that \nthe costs for this particular market participant would represent the \nmaximum degree of compliance costs.\\571\\ The Commission anticipates, \nhowever, that in many cases the actual costs to established market \nparticipants (including swap counterparties, SDRs and other registered \nentities) would be lower than for the reference entity--perhaps \nsignificantly so, depending on the type, flexibility, and scalability \nof systems already in place."
block2 = "<SUP>569 570</SUP>"
superScripts = re.findall("<SUP>([0-9]+ )+[0-9]+</SUP>", block2)  ## This only fixes two citations in a superscript.  OK for now...
print superScripts

['569 ']


### 2.2 Processing the document's Full Text into just footnotes which contain citations

Now that we can get the full text of the rules, we need to search it for citations which contain references to comment letters.  These citations happen in footnotes, so we need an ## step process to get to our final goal: a dictionary which contains all citations to comment letters for a given rule.

We shall define a function, `get_citations_to_comments` which takes a regulations.gov `"Full Text"` and returns a list of comments.  It will: 

1. Take the full text and split it into:
 1. a list of body text blocks followed by a citation to a footnote 
 1. a list of footnotes
1. Itterate through the list of text blocks, and select only those which refer to "comment"
 1. for those which refer to comments, add the associated footnote to a list of "citations"
1. Return that list of citations as a list of strings

Note: Regulations.gov is AMAZINGLY inconsistent in their encoding of HTML for the full text with lots of little exceptions and variations in format for the .htm document.  Thus, there are a large number of small fixes in the function to handle these cases.  Each follows a pattern rather than simply a single hard-code fix, but be careful changing anything without lots of testing to make sure that there are consisten results.  The code was originally debugged with documentId = CFTC-2013-0056-0001.  It works correctly.  Any changes need to not break the parsing of that document

In [89]:
def get_citations_to_comments(FullText):
 
    # get rid of the header information, all we want is the supplementary information section.
    #temp = FullText.split("SUPPLEMENTARY INFORMATION:")
    #FullText = temp[1]
    
    #change the ---- based delineator into something more unique to prevent problems with dashes in the actual text.
    FullText = FullText.replace("\n---------------------------------------------------------------------------\n\n",
                                "||~~Block~Separator~~||")
    
    # fix the problem with italics and footnotes causing body and footnote blocks to lack a --- separator and 
    ## instead have only a blank line separator
    FullText = FullText.replace("\n\n    \\","|~~TEMP~FOOTNOTER~~||") #first preserve the footnote pattern
    FullText = FullText.replace("\n\n"," ||~~Block~Separator~~||") #the fix the irregular block delineator problem
    FullText = FullText.replace("|~~TEMP~FOOTNOTER~~||","\n\n    \\") #now replace the footnote pattern
    
    #fix the problem with a footnote at the end of a line which does not actually end the block.
    FullText = FullText.replace("\\\n", "\\ \n")
    
    #get rid of page breaks
    FullText = FullText.replace("\n\n[["," [[")
    FullText = FullText.replace("]]\n\n","]] ")
    
    #Paragraphs are block changes too, so let's get that marked
    FullText = FullText.replace("\n\n","||~~Block~Separator~~||") ###############################################<------
     
    # Un-Wrap all the lines in the FullText to fix the silly line truncations
    streamOfText = FullText.replace("\n","")
    

    

    
    # Split into blocks, some of which are footnotes and some of which are body content.  
    ##  Use the special delineator we added earlier
    blocks = streamOfText.split("||~~Block~Separator~~||")
    '''
    print blocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[2]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print blocks[3]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''    
    #Get rid of the blank blocks and those with just a single space
    blocks = filter(None,blocks)
    
    #Separate out the body content and footnote blocks based on the pattern that footnote blocks ALWAYS start with a "\\#\\" 
    bodyBlocks = []
    footnoteBlocks = []
    
    for block in blocks:
        if block.startswith("    \\"):
            footnoteBlocks.append(block)
        else :
            bodyBlocks.append(block)
     
    
    '''    
    print "##################### Body Blocks #######################"
    print bodyBlocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print bodyBlocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print "##################### Footnote Blocks #######################"
    print footnoteBlocks[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print footnoteBlocks[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''    
    #Split the footnote blocks into individual footnotes
    footnotes = []
    for block in footnoteBlocks:
        # split out the footnotes
        tempFootnotes = block.split("    \\")
        
        # remove the paragraph placeholder from the beginning of the first footnote in a group
        tempFootnotes[0] = tempFootnotes[0].replace("|~P~|","",1)
        
        #add the footnotes to the main list
        footnotes = footnotes + tempFootnotes[1:]
    
    #split the body content blocks at the place where each footnote appears
    bodyContent = []
    for block in bodyBlocks:
        # split out each body content based on the end of the footnote reference
        block = block.replace("\\;","\\ ;") # to fix cites to footnotes which are followed by a ;
        block = block.replace("\\:","\\ :") # to fix cites to footnotes which are followed by a ;
        block = block.replace("\\)","\\ )") # to fix cites to footnotes which are inside parantheses
        block = block.replace("\\(","\\ (") # to fix cites to footnotes which are inside parantheses
        block = block.replace("\\-","\\ -") # to fix cites to footnotes followed by a -
        block = block.replace("\\,", "\\ ,") # to fix cites to footnotes which are followed by a ,
        block = block.replace("\\.", "\\ .") # to fix cites to footnotes which are followed by a ., which is terrible english
        ## output = re.sub(r'<(?=\d)', r'\r\n<', str)
        #block = re.sub(r'<SUP>[0-9]+ ',,block)
        #re.search(pattern, string, flags=0)
        superScripts = re.findall("<SUP>([0-9]+ )+[0-9]+</SUP>", block)  ## This only fixes two citations in a superscript.  OK for now...
        for match in superScripts:
            block = block.replace(match,match.strip()+"\\ \\")
        block = block.replace("<SUP>","\\") # to fix cites to footnotes which were hacked to superscript rather than proper \\
        block = block.replace("</SUP>","\\") # to fix cites to footnotes which were hacked to superscript rather than proper \\
        if block.endswith("\\"): #to handle blocks which end with a citation to a footnote
            block = block+" "
        tempBodyContent = block.split("\\ ")
                
        #print bodyBlocks.index(block) ## debug
        
        #print tempBodyContent
        
        #print type(tempBodyContent)
        
        # filter out body content blocks without a footnote reference at the end
        if type(tempBodyContent) == list:
            tempBodyContent = [chunk for chunk in tempBodyContent if chunk.endswith(('0','1','2','3','4','5','6','7','8','9'))]
        elif not tempBodyContent.endswith(('0','1','2','3','4','5','6','7','8','9')):
            tempBodyContent = []
        
        #add the body content to the main list
        bodyContent = bodyContent + tempBodyContent
    
    #print "##################### Body Content ####################### Total: "+str(len(bodyContent))
    '''
    print bodyContent[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print bodyContent[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    #print "##################### Footnotes ########################## Total: "+str(len(footnotes))
    '''
    print footnotes[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print footnotes[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    #permanent status message intended for final use
    print "    There were "+str(len(bodyContent))+" body content chunks and "+str(len(footnotes))+" footnotes found",
    
        
    
    if (len(footnotes)-len(bodyContent)<>0):
        print "\n        ##################### Body Content TESTER #######################"
        errorsFound = False
        for chunk in bodyContent:
            if not chunk.rstrip().endswith(str(bodyContent.index(chunk)+1)):
                print " ~~# Error Chunk -2, index: " + str(bodyContent.index(chunk)-2) +"|| "+ bodyContent[bodyContent.index(chunk)-2]
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk -1, index: " + str(bodyContent.index(chunk)-1) +"|| "+ bodyContent[bodyContent.index(chunk)-1]
                print " ~~~~~~~~~~~~~~"
                print " ### ERROR CHUNK ###, index: " + str(bodyContent.index(chunk)) +"|| "+ chunk
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk +1, index: " + str(bodyContent.index(chunk)+1) +"|| "+ bodyContent[bodyContent.index(chunk)+1]
                print " ~~~~~~~~~~~~~~"
                print " ~~# Error Chunk +2, index: " + str(bodyContent.index(chunk)+2) +"|| "+ bodyContent[bodyContent.index(chunk)+2]
                print " ~~~~~~~~~~~~~~"
                errorsFound = True
                break
        if not errorsFound: 
            print "        No alignment errors found\n    ",
    
    #print (bodyContent[len(bodyContent)-5:len(bodyContent)])
    #for index in range(0,len(bodyContent)):
    #    print" ~~# index: " + str(index) +"|| "+ bodyContent[index]
    
    # search through the bodyContent for the word "comment".  If found, add the corresponding footnote to the citations list
    ## note: the index of the bodyContent and footnotes lists are synchronized AND that the index is the footnote number minus 1
    ##       i.e. bodyContent 1 references footnote 1 and both are at index 0 in their corresponding lists
    ## note: by walking through both lists simultaneously, we keep the footnotes which are added to "citations" in the order
    ##       that they appear in the text.  This could be useful for some other project, if we cared about order.
    citations = []
    
    for index in range(0,len(footnotes)):
        # test if the footnote talks about a comment and then add it to the list of citations
        if "comment" in footnotes[index].lower():
            citation = footnotes[index]
            citations.append(citation)
        # test if the body content chunk talks about a comment and then add the corresponding footnote to citations
        elif "comment" in bodyContent[index].lower():
            citation = footnotes[index]
            citations.append(citation)
    
    print "of which "+str(len(citations))+" contained citations"
    
    '''
    print "##################### Citations ##########################"
    print citations[0]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    print citations[1]
    print "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
    '''
    return(citations)

In [83]:
test2 = get_citations_to_comments(test[0]['Full Text'])
test2[1:10]

    There were 711 body content chunks and 711 footnotes found of which 171 contained citations


['26\\ The interested parties who either submitted comment letters or met with Commission staff included end-users, potential swap dealers, asset managers, industry groups/associations, potential SDRs, a potential SEF, multiple law firms on behalf of their clients and a DCM. Of the 105 comment letters submitted in response to the Initial Proposal, 42 letters focused on various issues relating to block trades and large notional off-facility swaps. Of the 40 meetings, five meetings focused on various issues relating to block trades and large notional off-facility swaps. All comment letters received in response to the Initial Proposal may be found on the Commission\'s Web site at: <a href="http://comments.cftc.gov/PublicComments/CommentList.aspx?id=919">http://comments.cftc.gov/PublicComments/CommentList.aspx?id=919</a>.',
 '27\\ A list of the full names and abbreviations of commenters who responded to the Initial Proposal and who the Commission refers to in the Further Block Proposal is 

In [14]:
## test the function on a couple documents
for document in test:
    print test.index(document),
    get_citations_to_comments(document['Full Text'])

0   There were 710 body content chunks and 711 footnotes found 
    ##################### Body Content TESTER #######################
 ~~# Error Chunk -2, index: 221|| .FX...............................  By numerated FX currency   Based on DCM futures                                    combinations (i.e.,        block size by swap                                    futures related) \222
 ~~~~~~~~~~~~~~
 ~~# Error Chunk -1, index: 222||     category \223
 ~~~~~~~~~~~~~~
 ### ERROR CHUNK ###, index: 223||                                    By non-enumerated FX       All trades may be                                    currency combinations      treated as block trades                                    (i.e., non-futures         \225
 ~~~~~~~~~~~~~~
 ~~# Error Chunk +1, index: 224||                                     related) \224
 ~~~~~~~~~~~~~~
 ~~# Error Chunk +2, index: 225|| Other Commodity..................  By economically-related    Based on DCM futures                          

IndexError: list index out of range

## 3 Get commenters from list of citations using Stanford Name Entity Recongnizer (NER)

Here we will take advantage of the excellent python interface for the NER created by Dat Hoang. It can be found here: http://github.com/dat/pyner

This will return a list of entities which are cited.


In [45]:
tagger = ner.SocketNER(host='localhost', port=8080)
testTag = tagger.get_entities("7\\ Commenters on this issue include: American Cotton Shippers Association; Agribusiness Association of Iowa; Agribusiness Association of Ohio; Agribusiness Council of Indiana; Trade Association of American Cotton Cooperatives; Commodity Markets Council; Falmouth Farm Supply; American Feed Industry Association; Grain and Feed Association of Illinois; Minnesota Grain and Feed Association; National Grain and Feed Association; Oklahoma Grain and Feed Association; Rocky Mountain Agribusiness Association; South Dakota Grain and Feed Association; Land O'Lakes; National Council of Farmer Cooperatives; American Gas Association; National Gas Supply Association; Fertilizer Institute; American Petroleum Institute; Electric Power Supply Association; National Rural Electric Cooperative Association; American Public Power Association; Large Public Power Council; Edison Electric Institute; Working Group of Commercial Energy Firms; IntercontinentalExchange Inc.; Kansas City Board of Trade; Minneapolis Grain Exchange; CME Group; Futures Industry Association; Barclays Capital; Henderson & Lyman; National Introducing Brokers Association; and National Futures Association.")
print testTag
print testTag.keys()


{u'ORGANIZATION': [u'American Cotton Shippers Association', u'Agribusiness Association of Iowa; Agribusiness Association of Ohio', u'Agribusiness Council of Indiana', u'Trade Association of American Cotton Cooperatives', u'Commodity Markets Council', u'Falmouth Farm Supply', u'American Feed Industry Association', u'Grain and Feed Association of Illinois', u'Minnesota Grain and Feed Association', u'National Grain and Feed Association', u'Oklahoma Grain and Feed Association', u'Rocky Mountain Agribusiness Association', u'South Dakota Grain and Feed Association', u'National Council of Farmer Cooperatives', u'American Gas Association', u'National Gas Supply Association', u'Fertilizer Institute', u'American Petroleum Institute; Electric Power Supply Association', u'National Rural Electric Cooperative Association', u'American Public Power Association', u'Large Public Power Council', u'Edison Electric Institute', u'Working Group of Commercial Energy Firms', u'IntercontinentalExchange Inc.', u

In [67]:
# function to take a citation list and return a list of name entity dictionaries where each has 
##   three values: name, type (Person or organization), and the footnote number where the entity appears in the original document
def get_name_entities(citations):
    tagger = ner.SocketNER(host='localhost', port=8080)
    namesAndOrgs = []
    for citation in citations:
        footnoteNumber = citation.strip().split("\\")[0] ## This get the footnote number from the document
        tempEntList = []
        entities = tagger.get_entities(citation)

        #add the organization entities with meta-data to the temp list for this citation
        for key in entities.keys():
            for entity in entities[key]:
                entDict = {}
                if key == u'ORGANIZATION':
                    entDict['Type']='Organization'
                    entDict['Name']= entity.encode('utf8')
                    entDict['Footnote_Number']=footnoteNumber
                    tempEntList.append(entDict)
                #add the person entities with meta-data to the temp list for this citation
                if key == u'PERSON':
                    entDict = {}
                    entDict['Type']='Person'
                    entDict['Name']= entity.encode('utf8')
                    entDict['Footnote_Number']=footnoteNumber
                    tempEntList.append(entDict)

        #Add the list of both person and organization entities for this citation to the master list
        namesAndOrgs = namesAndOrgs+tempEntList
    # Return the full list of entity dictionaries which represent the entity and associated metadata.
    return namesAndOrgs

In [68]:
#testing
test3 = get_name_entities(test2)
test3[len(test3)-30:len(test3)]

[{'Footnote_Number': '41',
  'Name': 'Foreign Boards of Trade',
  'Type': 'Organization'},
 {'Footnote_Number': '64', 'Name': 'Commission', 'Type': 'Organization'},
 {'Footnote_Number': '66',
  'Name': 'Working Group of Commercial Energy Firms',
  'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Commission', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Working Group', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'MSP', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Commission', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Working Group', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Sec', 'Type': 'Organization'},
 {'Footnote_Number': '66', 'Name': 'Commission', 'Type': 'Organization'},
 {'Footnote_Number': '66',
  'Name': 'Futures Commission Merchant',
  'Type': 'Organization'},
 {'Footnote_Number': '66',
  'Name': 'Futures Commission Merchants',
  'Type': 'Organization'},
 {'Footnote_Number': '69',

## 4 Create a dictionary with all citations to commenters related to CFTC Dodd-Frank rules

Each of the cited commenters is stored as a dictionary entry which contains the following meta-data in addition to the name (personal or organizational) of the commenter:

1. Name_of_Commenter
1. Type_of_Commenter
1. Comment_FR_Reference
1. documentId
1. Footnote_Number

### 4.1 Get the list of FR references from the CFTC commentor datafile
We need to start by pulling in a list of FR References which we will feed to the `get_docs_from_api` function.

In [30]:
FR_References = [] # create empty list to store lines
with open('CFTC_Comment_FR_References.txt') as my_file:
    for line in my_file:
        FR_References.append(line.strip()) # line.strip() will get rid of line breaks characters.

In [31]:
#just testing
FR_References[1:10]

['75 FR 51429',
 '75 FR 59666',
 '75 FR 63732',
 '75 FR 65586',
 '75 FR 67258',
 '75 FR 63113',
 '75 FR 72816',
 '75 FR 67657',
 '75 FR 67301']

### 4.2 Now we need to cycle through all of the FR References and get all the citations to commenters all the final rules

We will use a system of nested loops which will build us a complete list of all citations to commenters by employing the functions laboriously created above.

In [84]:
##################################################################
##################################################################
##################################################################
###                                                            ###
###  NOTE: This will NOT work until You Turn on NER in bash    ###
###        look in the ner.[location].sh file, and run in a    ###
###        new git.bash window                                 ###
###                                                            ###
##################################################################
##################################################################
##################################################################

# Create a blank master-list
all_citations_to_commenters = []

# populate that list
indexer = 0
for FrRef in FR_References:
    print str(indexer)+" ",
    indexer = indexer+1
    documents = get_docs_from_api(FrRef,'FR') #recall that the 2nd argument 'FR' tells the function to pull the final rule.
    if len(documents)<>0:
        for doc in documents:
            # this gives us a list of citations (strings)
            doc_citations = get_citations_to_comments(doc['Full Text'])

            #this accepts a list of strings and returns them as dictionary entries with 'Name','Footnote_Number', and 'Type' keys
            cites_to_commenters = get_name_entities(doc_citations)
            # Add in the meta-data which the get_name_entitites function does not have
            for cite in cites_to_commenters:
                cite['Comment_FR_Reference']=FrRef
                cite['documentId']=doc['documentId']
            #Add the now-enriched dictionary entries for each citation to a commenter to the master list
            all_citations_to_commenters = all_citations_to_commenters + cites_to_commenters
            

0  There are 1 documents returned by "75 FR 3281"
    There were 45 body content chunks and 45 footnotes found of which 11 contained citations
1  There are 2 documents returned by "75 FR 51429"
    There were 1472 body content chunks and 1472 footnotes found of which 573 contained citations
    There were 1663 body content chunks and 1663 footnotes found of which 733 contained citations
2  There are 2 documents returned by "75 FR 59666"
    There were 63 body content chunks and 63 footnotes found of which 16 contained citations
    There were 48 body content chunks and 48 footnotes found of which 21 contained citations
3  There are 9 documents returned by "75 FR 63732"
    There were 326 body content chunks and 326 footnotes found of which 78 contained citations
    There were 1153 body content chunks and 1153 footnotes found of which 580 contained citations
    There were 146 body content chunks and 146 footnotes found of which 35 contained citations
    There were 1472 body content c

In [85]:
# test your code
len(all_citations_to_commenters)

142761

In [87]:
all_citations_to_commenters[500]

{'Comment_FR_Reference': '75 FR 51429',
 'Footnote_Number': '447',
 'Name': 'Farm Credit Council',
 'Type': 'Organization',
 'documentId': 'CFTC-2012-0102-0001'}

## 5 Export as CSV

Export the object all_citations_to_commenters into a CSV file.

In [88]:
keys = all_citations_to_commenters[0].keys()
#writing the rest
with open('Data/(2015-12-07)Test_Run_of_All__Errors_but_no_exception.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_citations_to_commenters)