# Final Project - Notice and comment
Project by Jason Danker, Proxima DasMohapatra, Ankur Kumar, Emily Witt and Kinshuk
## Notebook title - Extract Data from API
**Overview ** : This code creates a list of information related to all document in all 10 categories. Then it extracts documents and related comments from documents in different categories. All the data is dumped in files before they are used by other part of code.

**API documentation**: http://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0

In [1]:
# Import list + API key from regulations.gov
from pickle import dump, load
import pandas
import requests
import urllib.request
from bs4 import BeautifulSoup
import PyPDF2
api_key = 'b5Uc6UQwdVYBhNV20O11AZFc6s2cMZ8tpYrUc9tV' 

### Part 1
regulation.gov divides documents in 10 different categories. These are
1. AD (Aerospace and Transportation)
2. AEP (Agriculture, Environment, and Public Lands)
3. BFS (Banking and Financial)
4. CT (Commerce and International)
5. LES (Defense, Law Enforcement, and Security)
6. EELS (Education, Labor, Presidential, and Government Services)
7. EUMM (Energy, Natural Resources, and Utilities)
8. HCFP (Food Safety, Health, and Pharmaceutical)
9. PRE (Housing, Development, and Real Estate)
10. ITT (Technology and Telecommunications)

In this part we will use the API to get  meta information about all the documents in the 10 categories. We store this information in 10 different files as dataframe and will later use these to get document ids of the documents taht we will pull

In [10]:
#Input: Name of category as defined in API docs
#Output: A dataframe with all meta information pertinent to all the documents in the given category
def get_all_doc_id(cat):
    offset=0
    url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+api_key+"&countsOnly=1&cat="+cat+"&sb=docketId&so=ASC"
    response = requests.get(url)
    #print(response.status_code)
    data = response.json()
    total = data['totalNumRecords']
    doc_list =[]
    for i in range(0,total,1000):
        url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+api_key+"&countsOnly=0&&rpp=1000&po="+str(i)+"&cat="+cat+"&sb=docketId&so=ASC"
        response = requests.get(url)
        #print("Offset:"+str(i)+" Code:"+str(response.status_code))
        data = response.json()
        doc_list += data['documents']
    doc_df = pandas.DataFrame(doc_list)
    return doc_df

In [11]:
# The following line of code calls the above fuction for all the 10 category and dumps the data in files 
# run this 10 times by changing the category name passed in line 1 and file name in line 2
df =get_all_doc_id("BFS")
output = open('data/BFS_doc_list', 'wb')
dump(df, output, -1)
output.close()

**Part 1 Test code**

In [14]:
#Look at one of the data files
bfs = load( open( "data/BFS_doc_list", "rb" ) )
bfs.head()

Unnamed: 0,agencyAcronym,allowLateComment,attachmentCount,commentDueDate,commentStartDate,commentText,docketId,docketTitle,docketType,documentId,documentStatus,documentType,frNumber,numberOfCommentsReceived,openForComment,organization,postedDate,rin,submitterName,title
0,ASC,False,1,2016-07-19T23:59:59-04:00,2016-05-20T00:00:00-04:00,,ASC-2016-0004,ASC NPRM regarding Proposed AMC Fees,Rulemaking,ASC-2016-0004-0001,Posted,Proposed Rule,2016-11914,108,False,,2016-05-20T00:00:00-04:00,Not Assigned,,Appraisal Subcommittee: Collection and Transmi...
1,CDFI,False,0,2013-04-08T23:59:59-04:00,2013-02-05T00:00:00-05:00,,CDFI-2013-0001,Bond Guarantee Program,Rulemaking,CDFI-2013-0001-0001,Posted,Rule,2013-02055,34,False,,2013-02-05T00:00:00-05:00,1559-AA01,,Guarantees for Bonds Issued for Community or E...
2,CDFI,False,0,2016-04-08T23:59:59-04:00,2016-02-08T00:00:00-05:00,,CDFI-2016-0001,Capital Magnet Fund,Rulemaking,CDFI-2016-0001-0001,Posted,Rule,2016-02132,9,False,,2016-02-08T00:00:00-05:00,Not Assigned,,Capital Magnet Fund
3,CDFI,False,0,2015-10-30T23:59:59-04:00,2015-08-31T00:00:00-04:00,,CDFI-2016-0002,Community Development Financial Institutions P...,Rulemaking,CDFI-2016-0002-0001,Posted,Rule,2015-21227,12,False,,2015-08-31T00:00:00-04:00,Not Assigned,,Community Development Financial Institutions P...
4,CDFI,False,0,2013-12-30T23:59:59-05:00,2013-10-31T00:00:00-04:00,,CDFI_FRDOC_0001,Recently Posted CDFI Rules and Notices.,Rulemaking,CDFI_FRDOC_0001-0005,Posted,Rule,2013-25872,0,False,,2013-10-31T00:00:00-04:00,Not Assigned,,Financial Reporting Requirements for Non-Profi...


### Part 2
In this part we will use the document meta information file to get document IDs of the document we want to pull.
At first we planned to pull all the documents with comment in the ID but later we had to limit the documents we pull because of the restriction on API. We will store all the document we pull in file which can be used in other part of the analysis. We used the approach of saving data at different steps in file because the whole process of getting docuemnt from API was time intensive (because of API restriction) and we did not want to do it over and over again.

In [15]:
#Input : Filepath of the meta information file for a particular category
#Output: A data frame of docuemnt information which only contains id of doc that has comments and arranged by comment vlume
#Explanation: This is a helper function that gives us document ID information. We use it to hand pick the document we download
def read_file_get_docid(filepath):
    dump_df = load(open(filepath,'rb'))
    df_with_comments = dump_df[dump_df.numberOfCommentsReceived > 0]
    df_with_comments =df_with_comments.sort(['numberOfCommentsReceived'], ascending=[False])
    doc_id = df_with_comments.documentId
    doc_type = df_with_comments.documentType
    #return [doc_id,set(doc_type)]
    return df_with_comments

**The cell below contains all the functions needed to get a document, its comments and processing PDF**

In [16]:
#Input: docID of the document that needs to be downloaded and API key
#Output: A dictionary containing document text and list of all comments
#Explanation: 1st function called. The first call it makes to the API fetches the link to document. 
#             To get the actual regulation text we have to open the link using urllib and parse it. 
#             To get comments we first extract the docket id from the document id. We then call function get_document_comments_from_api
#             From the function we either get comment text (for comments directly written) or comment id for attached comment.
#             We call get_attached_comments to get all the attached comments. 
#             We ten merge the regulation text and all comments (attached and direct) into a dictionary and return
def get_document_content_from_api(docId,key=api_key):
    url = "http://api.data.gov:80/regulations/v3/document.json?api_key="+key+"&documentId="+docId
    try:
        response = requests.get(url)
    except:
         print("(api opening doc exception) error log"+url)   
    if response.status_code != 200:
        print("status code "+str(response.status_code)+" (get_document_content_from_api) program will break at this point which is ok because we dont need inconsistent data. Run again")
    data = response.json()
    
    # Get HTML for document content
    if(len(data['fileFormats']) == 2):    
        link = data['fileFormats'][1] # The second link is the document in HTML format
    else:
        link = data['fileFormats'][0]
    access_link = link+'&api_key='+key
    
    try:
        with urllib.request.urlopen(access_link) as response:
            html = response.read()
    except:
        print("doc file opening exception")
    
    # We are interested in the pre tag of the HTML content
    soup = BeautifulSoup(html, "lxml")
    content = soup.find_all('pre')
    
    # Now we need to construct comment_id from document_id
    docket_id = '-'.join(docId.split('-')[:3])
    comment_df = get_document_comments_from_api(docket_id)
    # get comment text where exists
    comment_list =[]
    if not comment_df.empty:
        if "commentText" in comment_df:
            comment_text =comment_df[comment_df.commentText.notnull()].commentText
            comment_list =comment_text.tolist()
        #get doc id where there is attchment
        c_ids = comment_df[comment_df.attachmentCount>0].documentId
        # get comment for each id in list
        for each_id in c_ids.unique():
            comment_list.append(get_attached_comments(each_id))
    doc_dict = {
        "text":content,
        "comment_list":comment_list
    }
    return doc_dict

#Input: Takes docket id and api_key
#Output: A datafarame will all comment information
#Explanation: The first API call this function makes is used to get the comment count. We can make call to API in loop to get comments.
#             We cna get 1000 comment information at a time. We add all this to a dataframe and return
def get_document_comments_from_api(docketId,key=api_key):
    offset=0
    url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+key+"&countsOnly=1&dct=PS&dktid="+docketId
    try:
        response = requests.get(url)
    except:
        print("(api opening comment count)error log"+url) # prints in case we are not able to read file
    if response.status_code != 200:
        print("status code"+str(response.status_code) + " (get_document_comments_from_api) program will break at this point which is ok because we dont need inconsistent data. Run again")
    data = response.json()
    total = data['totalNumRecords']
    com_list =[]
    for i in range(0,total,1000):
        url = "http://api.data.gov:80/regulations/v3/documents.json?api_key="+key+"&countsOnly=0&&rpp=1000&po="+str(i)+"&dct=PS&dktid="+docketId
        try:
            response = requests.get(url)
        except:
            print("(api opening actual comments)error log"+url) # prints in case we are not able to read file
        #print("Offset:"+str(i)+" Code:"+str(response.status_code))
        if response.status_code != 200:
            print(response.status_code)
        data = response.json()
        com_list += data['documents']
    com_df = pandas.DataFrame(com_list)
    return com_df

#Input: Attached comment id and API key
#Output: Text of attached comment
#Explanation: For the comments that are attached we need to make more calls (1 for each comment). 
#             We first get the link of the comment and then download the PDF to a file. We then convert this pdf to text and send it back
def get_attached_comments(comment_id, key=api_key):
    #print(each_id) # fro debugging
    #open the api to get file url
    url = "http://api.data.gov:80/regulations/v3/document.json?api_key="+key+"&documentId="+comment_id
    try:
        response = requests.get(url)
    except:
        print("(api opening of attached comment exception)error log" +  url)
        return ""
    if response.status_code != 200:
        print("status code " +str(response.status_code)+" (get_attached_comments) program will break at this point which is ok because we dont need inconsistent data. Run again ")
    data = response.json()
    att_count = 0
    for i in range(len(data["attachments"])):
        if "fileFormats" in data["attachments"][i]:
            att_count = len(data["attachments"][0]["fileFormats"])
    comment_text =""
    for i in range(att_count):
        if data["attachments"][0]["fileFormats"][i].endswith("pdf"):
            link = data["attachments"][0]["fileFormats"][i] 
            access_link = link+'&api_key='+key
            #download file(pdf) and read pdf (page by page)
            download_file(access_link)
            pdfFileObj = open('document.pdf','rb')     #'rb' for read binary mode
            try:
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
                pno = pdfReader.numPages
                for i in range(pno):
                    pageObj = pdfReader.getPage(i)          #'i' is the page number
                    comment_text += pageObj.extractText()
            except:
                print("(pdf exception)cant read "+comment_id ) # prints in case we are not able to read file
            break # execute the whole thing for 1st found pdf
    return comment_text

#Input: Link of PDF to be downloaded
#Output: None. 
#Explanation: It just writes PDF content to a PDF file
def download_file(download_url):
    try:
        response = urllib.request.urlopen(download_url)
        file = open("document.pdf", 'wb')
        file.write(response.read())
        file.close()
    except:
        print("(downloading the pdf exception)error log" +  download_url)
    


#### Wrapping all of the above to download data in a file

In [17]:
# WARNING: Run only one to intialize file
op = open("data/Master_doc_content", 'wb')
dump([], op, -1)
op.close()

In [18]:
#Input: name of file (added this so that we can have multiple files when one gets too big), docID to download
#Output: None
#explanation: Calls get_document_content_from_api function to doewnload 1 doc that is passed. 
#            Then it opens the file (that is already an array) and appends result to the array.
#            Data format of resulting file is array of dicts
def get_one_doc(name,docid):
    #open file - get present data (empty to begin witj)
    filepath = "data/"+name+"_doc_content"
    inp =open(filepath,'rb')
    doc_collection = load(inp)
    inp.close()
    #get that one doc
    resp = get_document_content_from_api(docid)
    doc_collection.append(resp)
    #put back in file 
    output = open(filepath, 'wb')
    dump(doc_collection, output, -1)
    output.close()


In [19]:
# run this line once or twice an every 1 hr
# we got all a list of document and downloaded them one by one so that we can get big docs instead of many in an hour
get_one_doc("Master",'SBA-2010-0001-0001')



# Testing section
This section is to check how document download is working and getting document IDs (for creating the aforementioned list of docID to be downloaded)

**To check how document download is working**

In [20]:
# To check how many docs have been downloaded
dc =load(open("data/Master_doc_content",'rb'))
len(dc)

1

In [21]:
#to check content
dc[0]["text"]

[<pre>
 [Federal Register: March 4, 2010 (Volume 75, Number 42)]
 [Proposed Rules]               
 [Page 10029-10058]
 From the Federal Register Online via GPO Access [wais.access.gpo.gov]
 [DOCID:fr04mr10-29]                         
 
 
 [[Page 10029]]
 
 -----------------------------------------------------------------------
 
 Part II
 
 
 
 
 
 Small Business Administration
 
 
 
 
 
 -----------------------------------------------------------------------
 
 
 
 13 CFR Parts 121, 127, and 134
 
 
 
 Women-Owned Small Business Federal Contract Program; Proposed Rule
 
 
 [[Page 10030]]
 
 
 -----------------------------------------------------------------------
 
 SMALL BUSINESS ADMINISTRATION
 
 13 CFR Parts 121, 127, and 134
 
 RIN 3245-AG06
 
  
 Women-Owned Small Business Federal Contract Program
 
 AGENCY: Small Business Administration.
 
 ACTION: Notice of proposed rulemaking; withdrawal of proposed rule.
 
 --------------------------------------------------------------------

In [22]:
# open metadata file
df_test =read_file_get_docid('data/BFS_doc_list')
df_test.reset_index(inplace = True)



In [23]:
# get some comment counts and index
df_test["numberOfCommentsReceived"][10:65] #60,61,62,63

10    9369
11    8561
12    6465
13    4627
14    4293
15    3153
16    2492
17    2329
18    1851
19    1768
20    1689
21    1626
22    1473
23    1440
24    1425
25    1165
26     998
27     979
28     972
29     854
30     843
31     839
32     829
33     741
34     671
35     622
36     592
37     555
38     539
39     538
40     425
41     408
42     406
43     398
44     367
45     320
46     318
47     298
48     291
49     283
50     252
51     231
52     230
53     224
54     224
55     218
56     216
57     212
58     202
59     201
60     193
61     182
62     181
63     166
64     165
Name: numberOfCommentsReceived, dtype: int64

In [25]:
# select documents based on above  (go for higher comment docs)
df_test["documentId"][23:30]

23     IRS-2012-0051-0001
24     OCC-2012-0010-0001
25    CFPB-2013-0002-0001
26     SBA-2010-0001-0001
27     IRS-2010-0017-0038
28     IRS-2010-0010-0001
29    CFPB-2015-0029-0001
Name: documentId, dtype: object