** This code extracts documents and related comments from documents in Bank & Financing Services category. This is one of the 10 categories found in regulations.gov **

In [1]:
from pickle import dump, load
import pandas
import requests
import urllib.request
from bs4 import BeautifulSoup

In [2]:
def read_file_get_docid(filepath):
    dump_df = load(open(filepath,'rb'))
    df_with_comments = dump_df[dump_df.numberOfCommentsReceived > 0]
    doc_id = df_with_comments.documentId
    doc_type = df_with_comments.documentType
    
    return [doc_id,set(doc_type)]

In [3]:
[doc_id_list,types] = read_file_get_docid('data/BFS_doc_list')
print(types)
# document ID with 4 parts represent documents. 3 parts represent dockets 
doc_ids = [doc_id for doc_id in doc_id_list if len(doc_id.split('-')) == 4]
len(doc_ids)

{'Notice', 'Rule', 'Proposed Rule', 'Other'}


761

### Using regulations.gov API
* We need to use the API to retrieve each document content. This API will use document_id that we extracted from the file above.
* For each document_id, we will need to construct comment_id based on the total number of comments on it.

In [11]:
api_key = 'vT4R3vZ8RpZhnCpgeCPx1LdWRSZS8yxHHGquPrxm'

def get_document_comments_from_api(docketId,key=api_key):
    offset=0
    url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+key+"&countsOnly=1&dct=PS&dktid="+docketId
    response = requests.get(url)
    #print(response.status_code)
    data = response.json()
    total = data['totalNumRecords']
    com_list =[]
    for i in range(0,total,500):
        url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+key+"&countsOnly=0&&rpp=500&po="+str(i)+"&dktid="+docketId
        response = requests.get(url)
        #print("Offset:"+str(i)+" Code:"+str(response.status_code))
        data = response.json()
        com_list += data['documents']
    com_df = pandas.DataFrame(com_list)
    return com_df

def get_document_content_from_api(docId,key=api_key):
    url = "http://api.data.gov:80/regulations/v3/document.json?api_key="+key+"&documentId="+docId
    response = requests.get(url)
    print(response.status_code)
    data = response.json()
    
    # Get HTML for document content
    link = data['fileFormats'][1] # The second link is the document in HTML format
    access_link = link+'&api_key='+key
    
    with urllib.request.urlopen(access_link) as response:
        html = response.read()
    
    # We are interested in the pre tag of the HTML content
    soup = BeautifulSoup(html)
    content = soup.find_all('pre')
    
    # Now we need to construct comment_id from document_id
    docket_id = '-'.join(docId.split('-')[:2])
    comments_df = get_document_comments_from_api(docket_id)
    # get comment text where exists
    comment_text =df[df.commentText.notnull()].commentText
    #get doc id where there is attchment
    c_id = df[df.attachmentCount>0].documentId
    # change the above to pandas series to array 
    # send back doc text, list of comment and list of comment id as a dict maybe? Do we need any other info?
    # for each dict sent back look fro the arracy of attachment and execute api request for all to get file link for the attachement. 
    #All these links would be pdf so must be passed through python imaging library
    doc_dict = {
        "text":content,
        "comment_list":comment_text.unique(),
        "comment_id_list": c_id.unique()
    }
    return doc_dict

In [12]:
resp = get_document_content_from_api(doc_ids[0])
resp

200




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


{'comment_id_list': array(['ASC-2016-0004-0004', 'ASC-2016-0004-0101', 'ASC-2016-0004-0067',
        'ASC-2016-0004-0007', 'ASC-2016-0004-0060', 'ASC-2016-0004-0068',
        'ASC-2016-0004-0107', 'ASC-2016-0004-0077', 'ASC-2016-0004-0096',
        'ASC-2016-0004-0001', 'ASC-2016-0004-0059', 'ASC-2016-0004-0054',
        'ASC-2016-0004-0009', 'ASC-2016-0004-0061', 'ASC-2016-0004-0065',
        'ASC-2016-0004-0037', 'ASC-2016-0004-0069', 'ASC-2016-0004-0045',
        'ASC-2016-0004-0072', 'ASC-2016-0004-0108', 'ASC-2016-0004-0051',
        'ASC-2016-0004-0008', 'ASC-2016-0004-0055', 'ASC-2016-0004-0062',
        'ASC-2016-0004-0094', 'ASC-2016-0004-0074', 'ASC-2016-0004-0066',
        'ASC-2016-0004-0085', 'ASC-2016-0004-0048', 'ASC-2016-0004-0006',
        'ASC-2016-0004-0058', 'ASC-2016-0004-0076', 'ASC-2016-0004-0053',
        'ASC-2016-0004-0064', 'ASC-2016-0004-0070', 'ASC-2016-0004-0005',
        'ASC-2016-0004-0002', 'ASC-2016-0004-0050', 'ASC-2016-0004-0031',
        'ASC-2016-0

In [7]:
df = get_document_comments_from_api("ASC-2016-0004")

In [9]:
ct = df[df.commentText.notnull()].commentText

ct.unique()[0]

'- I have added additional commentary to my original comment, with the full comments below:\n- How much discussion has there been regarding a flat fee option, rather than being based on the number of appraisers doing business with the AMC? Commentary and discussion have likened this ASC fee to the fee embedded in appraiser\'s state renewal fees; however, their is a significant difference: appraisers are not paying their ASC fee based on how much they work - it is a one-sized flat fee. While the impact of a per-appraiser fee could easily be absorbed by large AMCs, it could have significant negative impact on smaller, local, and regional AMCs that provide service to lenders, banks, and credit unions. Additionally, a flat one-sized fee, rather than a per-appraiser fee, would be more easily calculated, enforced, and collected, and would have less impact on AMCs.\n- Nevertheless, in the absence of an apparent consideration of a flat one-sized fee option, the option (third) for calculating t

In [10]:
di = df[df.attachmentCount>0].documentId
di.unique()

array(['ASC-2016-0004-0004', 'ASC-2016-0004-0101', 'ASC-2016-0004-0067',
       'ASC-2016-0004-0007', 'ASC-2016-0004-0060', 'ASC-2016-0004-0068',
       'ASC-2016-0004-0107', 'ASC-2016-0004-0077', 'ASC-2016-0004-0096',
       'ASC-2016-0004-0001', 'ASC-2016-0004-0059', 'ASC-2016-0004-0054',
       'ASC-2016-0004-0009', 'ASC-2016-0004-0061', 'ASC-2016-0004-0065',
       'ASC-2016-0004-0037', 'ASC-2016-0004-0069', 'ASC-2016-0004-0045',
       'ASC-2016-0004-0072', 'ASC-2016-0004-0108', 'ASC-2016-0004-0051',
       'ASC-2016-0004-0008', 'ASC-2016-0004-0055', 'ASC-2016-0004-0062',
       'ASC-2016-0004-0094', 'ASC-2016-0004-0074', 'ASC-2016-0004-0066',
       'ASC-2016-0004-0085', 'ASC-2016-0004-0048', 'ASC-2016-0004-0006',
       'ASC-2016-0004-0058', 'ASC-2016-0004-0076', 'ASC-2016-0004-0053',
       'ASC-2016-0004-0064', 'ASC-2016-0004-0070', 'ASC-2016-0004-0005',
       'ASC-2016-0004-0002', 'ASC-2016-0004-0050', 'ASC-2016-0004-0031',
       'ASC-2016-0004-0044', 'ASC-2016-0004-0071', 

In [30]:
df

Unnamed: 0,agencyAcronym,allowLateComment,attachmentCount,commentDueDate,commentStartDate,commentText,docketId,docketTitle,docketType,documentId,documentStatus,documentType,frNumber,numberOfCommentsReceived,openForComment,organization,postedDate,rin,submitterName,title
0,ASC,False,1,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0004,Posted,Public Submission,,1,False,None given,2016-06-29T00:00:00-04:00,Not Assigned,Mark Larson,Mark Larson - 2016.06.28
1,ASC,False,0,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,- I have added additional commentary to my ori...,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0043,Posted,Public Submission,,1,False,Property Interlink,2016-07-18T00:00:00-04:00,Not Assigned,Joshua Walitt,"Comment from Joshua Walitt, Property Interlink"
2,ASC,False,1,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,See attached.,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0101,Posted,Public Submission,,1,False,RRR,2016-07-19T00:00:00-04:00,Not Assigned,Jeff Graham,"Comment from Jeff Graham, RRR"
3,ASC,False,1,,2016-07-19T00:00:00-04:00,,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0067,Posted,Public Submission,,1,False,"Market Appraisals, Inc.",2016-07-19T00:00:00-04:00,Not Assigned,Margo Henson,2016.07.13 - Market Appraisal Inc - Margo Henson
4,ASC,False,0,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,It is my understanding that there is a propose...,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0022,Posted,Public Submission,,1,False,Kentucky Appraiser's Board,2016-07-18T00:00:00-04:00,Not Assigned,Carol Jones,"Comment from Carol Jones, Kentucky Appraiser's..."
5,ASC,False,0,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,I do support the registration fees for the AMC...,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0017,Posted,Public Submission,,1,False,SCPAC,2016-07-18T00:00:00-04:00,Not Assigned,Elaine Morgan,"Comment from Elaine Morgan, SCPAC"
6,ASC,False,0,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,I ask that the ASC prohibit AMC's from passing...,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0014,Posted,Public Submission,,1,False,Kavanagh Appraisal Co,2016-07-18T00:00:00-04:00,Not Assigned,Joe Kavanagh,"Comment from Joe Kavanagh, Kavanagh Appraisal Co"
7,ASC,False,0,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,Please consider not passing along the AMC regi...,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0039,Posted,Public Submission,,1,False,1958,2016-07-18T00:00:00-04:00,Not Assigned,Thomas Dilts,"Comment from Thomas Dilts, 1958"
8,ASC,False,1,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,See Attached,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0007,Posted,Public Submission,,1,False,none given,2016-07-05T00:00:00-04:00,Not Assigned,William Parker,William Parker - 2016.06.30
9,ASC,False,1,,2016-07-19T00:00:00-04:00,,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0060,Posted,Public Submission,,1,False,none given,2016-07-19T00:00:00-04:00,Not Assigned,John Neubauer,2016.07.11 - John Neubauer
