## Capstone 1: Oregon State Legislature Bill Clustering

### Data Wrangling Part 1: Getting PDF documents from a URL:


In [1]:
import pandas as pd
import numpy as np
import requests
import urllib.request

### Obtain the URL data from the API and store in a dataframe:

In [2]:
from sodapy import Socrata

In [3]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.oregon.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.oregon.gov,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("murb-ru5f", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



### Extract usable portions of text column for file naming

The first step is to isolate the relevant text from the "measure_number" column.
We need the url of the bill.

Second is to add columns for the governor and year.

We want to  build the filename to be:
file_name = gov_name + "_" + gov_year + "_" + the_bill

Example:
'Brown_2018_HB405'


In [4]:
#get the abbreviated name of the bill

#create an empty set to hold the results
simple_name = []
#loop through the measure_number column
for a in results_df['measure_number']:
    
    #turn the column into a string
    x = str(a)
    
    #grab everything after the "//"
    after_https = x.split('//')[1]
    
    #grab everything after the 5th "/"
    the_bill = after_https.split('/')[5]
    
    #split on the comma and take the first item
    simple_name_a= the_bill.split("'",1)[0]
    
    #add the result to the empty set    
    simple_name.append(simple_name_a)
    
#place the results into a new column in the df    
results_df['simple_name'] = simple_name


In [5]:
#follow same steps as above to extract relevant text.

final = []
for url in results_df['measure_number']:
    x =str(url)
    y = x.split(": '", 1)[1]
    z = y.split(",",1)[0]
    final_url = z.split("'",1)[0]
    final.append(final_url)
results_df['bill_url'] = final


In [6]:
#add a column for the governor name 

for name in results_df:
    name = "Kitzhaber"
results_df["gov_name"] = name

#add a column for the year the bill was signed

for name in results_df:
    name = "2014"
results_df["gov_year"] = name


In [7]:
#compose the file name 
results_df['file_name']= results_df['gov_name']+ "_" + results_df['gov_year'] + "_" + results_df['simple_name']

In [8]:
results_df.head()

Unnamed: 0,measure_number,signed_or_vetoed,date,relating_to_clause,links,simple_name,bill_url,gov_name,gov_year,file_name
0,{'url': 'https://olis.leg.state.or.us/liz/2014...,<p>Signed</p>,2014-02-26T00:00:00.000,<p>Relating to the Oregon Ocean Science Trust;...,,SB1545,https://olis.leg.state.or.us/liz/2014R1/Downlo...,Kitzhaber,2014,Kitzhaber_2014_SB1545
1,{'url': 'https://olis.leg.state.or.us/liz/2014...,<p>Signed</p>,2014-02-26T00:00:00.000,<p>Relating to vessel ocean Dungeness crab per...,,HB4049,https://olis.leg.state.or.us/liz/2014R1/Downlo...,Kitzhaber,2014,Kitzhaber_2014_HB4049
2,{'url': 'https://olis.leg.state.or.us/liz/2014...,<p>Signed</p>,2014-02-26T00:00:00.000,<p>Relating to entrepreneurial development;</p>,,SB1563,https://olis.leg.state.or.us/liz/2014R1/Downlo...,Kitzhaber,2014,Kitzhaber_2014_SB1563
3,{'url': 'https://olis.leg.state.or.us/liz/2014...,<p>Signed</p>,2014-02-26T00:00:00.000,<p>Relating to applications for exotic animal ...,,SB1584,https://olis.leg.state.or.us/liz/2014R1/Downlo...,Kitzhaber,2014,Kitzhaber_2014_SB1584
4,{'url': 'https://olis.leg.state.or.us/liz/2014...,<p>Signed</p>,2014-03-03T00:00:00.000,<p>Relating to continuity in the enrollment of...,,HB4007,https://olis.leg.state.or.us/liz/2014R1/Downlo...,Kitzhaber,2014,Kitzhaber_2014_HB4007


## Data Wrangling Part 2: Extracting text from PDF documents

### Writing a single pdf file locally

In [9]:
_ = results_df['bill_url'][0]

In [10]:
urllib.request.urlretrieve(_,r'C:\Users\ASUS\single.pdf')

('C:\\Users\\ASUS\\single.pdf', <http.client.HTTPMessage at 0x15c16921bc8>)

### Writing a PDF file for each bill 

In [11]:
#use a lambda function that goes through each row
#and takes the entry in the bill url column
#and writes a pdf using the file name as the name of the pdf.
results_df.apply( lambda row: urllib.request.urlretrieve( row[ 'bill_url' ], r'C:\Users\ASUS\PDFsCapstone\{}.pdf'.format( row[ 'file_name' ] ) ), axis=1 )

0      (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB1...
1      (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_HB4...
2      (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB1...
3      (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB1...
4      (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_HB4...
                             ...                        
116    (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB5...
117    (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB5...
118    (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_SB5...
119    (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_HB4...
120    (C:\Users\ASUS\PDFsCapstone\Kitzhaber_2014_HB4...
Length: 121, dtype: object

### Extracting text from all pages of a PDF

In [12]:
import PyPDF2

#We first open the file and create an object.
pdfFileObject = open(r"C:\Users\ASUS\PDFsCapstone\Brown_2019_SB 98.pdf", 'rb')

#Next we read the object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
pdfReader.numPages

4

In [13]:
#Now we need to extract the text from all pages
#so we write a loop to accomplish this:

#loop through each page in number of pages
for pageNum in range(0, pdfReader.numPages):
    #get each page in document
    pageObj = pdfReader.getPage(pageNum)
    #print each page in document
    print(pageObj.extractText())

80th OREGON LEGISLATIVE ASSEMBLY--2019 Regular SessionEnrolledSenate Bill 98Printed pursuant to Senate Interim Rule 213.28 by order of the President of the Senate in conform-ance with presession filing rules, indicating neither advocacy nor opposition on the part of thePresident (at the request of Senate Interim Committee on Environment and Natural Resources)CHAPTER.................................................AN ACTRelating to renewable natural gas; and prescribing an effective date.
Be It Enacted by the People of the State of Oregon:SECTION 1.Sections 2 to 6 of this 2019 Act are added to and made a part of ORS chapter757.SECTION 2.(1) The Legislative Assembly finds and declares that:(a) Renewable natural gas provides benefits to natural gas utility customers and to thepublic; and(b) The development of renewable natural gas resources should be encouraged to supporta smooth transition to a low carbon energy economy in Oregon.(2) The Legislative Assembly therefore declares that:
(a) 