### Extracting Bill Information from Webpage

This notebook data mines the bill information directly from the Florida Senate Website. I see that the website is structured where there is a specific format for the URLs and the last part of the URL contains the specific bill in question. By providng the URL for the bill, this code is able to extract the relevant info. 

In [3]:
import requests
from bs4 import BeautifulSoup

# URL of the Florida Senate Bill page
url = 'https://www.flsenate.gov/Session/Bill/2024/115/ByCategory/?Tab=BillText'

# Send a GET request to the Florida Senate Bill page
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Successfully retrieved the webpage.")
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract the description of the bill from the <p> element with class 'width80'
    bill_description = soup.find('p', class_='width80').get_text(strip=True)
    print(f'Bill Description: {bill_description}')
    
    # Find the HTML elements containing the bill text links
    # Update this selector based on actual page structure
    # For example, if the links are contained within <a> tags with a specific class or ID
    bill_text_links = soup.select('a.specific-class')  # Update with the correct class or ID
    
    # Check if any links were found
    if bill_text_links:
        print(f"Found {len(bill_text_links)} bill text links.")
        for link in bill_text_links:
            # Extract the URL to the bill text and the text title
            bill_url = link['href']
            print(f'Bill Text URL: {bill_url}')
    else:
        print("No bill text links were found. Check the selector.")

else:
    print(f'Failed to retrieve the webpage, status code: {response.status_code}')


Successfully retrieved the webpage.
Bill Description: Progressive Supranuclear Palsy and Other Neurodegenerative Diseases Policy Workgroup;Requires State Surgeon General to establish progressive supranuclear palsy & other neurodegenerative diseases policy workgroup; provides for duties, membership, & meetings of workgroup; requiring State Surgeon General to submit annual reports & final report by specified date to Governor & Legislature.
No bill text links were found. Check the selector.


### Downloading the Bill Information for Analysis

This code below downloads the PDF of the bill and extracts the texual information needed for summarization. 

In [2]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# URL of the Florida Senate Bill page
base_url = 'https://www.flsenate.gov'
bill_page_url = '/Session/Bill/2024/115/ByCategory/?Tab=BillText'

# Send a GET request to the Florida Senate Bill page
response = requests.get(urljoin(base_url, bill_page_url))

# Check if the request was successful
if response.status_code == 200:
    print("Successfully retrieved the webpage.")
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract the bill PDF link from the <a> element with class 'lnk_BillTextPDF'
    bill_pdf_link = soup.find('a', class_='lnk_BillTextPDF')
    
    if bill_pdf_link:
        # Construct the full URL to the bill text PDF
        bill_pdf_url = urljoin(base_url, bill_pdf_link['href'])
        print(f'Bill PDF URL: {bill_pdf_url}')
        
        # Download the bill PDF
        pdf_response = requests.get(bill_pdf_url)
        
        if pdf_response.status_code == 200:
            # Define the local path where you want to save the PDF
            local_filename = "bill_text.pdf"
            with open(local_filename, 'wb') as pdf_file:
                pdf_file.write(pdf_response.content)
            print(f'Successfully downloaded the bill text PDF to {local_filename}.')
        else:
            print(f'Failed to download the PDF, status code: {pdf_response.status_code}')
    else:
        print("The link to the bill text PDF was not found.")
else:
    print(f'Failed to retrieve the webpage, status code: {response.status_code}')


Successfully retrieved the webpage.
Bill PDF URL: https://www.flsenate.gov/Session/Bill/2024/115/BillText/c1/PDF
Successfully downloaded the bill text PDF to bill_text.pdf.


In [None]:
### SUPER IMPORTANT YOU USE THIS VERSION OF OPENAI API
# Default version discontinued these features. 

%pip install openai==0.28

In [4]:
import openai
import fitz  # PyMuPDF

# OpenAI API Key - Make sure to keep this secure and do not expose it in your code
openai.api_key = ''

# Function to summarize text using the OpenAI Chat Completions API
def summarize_with_openai_chat(text, model="gpt-3.5-turbo"):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are going to generate a 1-3 sentence response summarizing each page of a bill passed in the florida senate. You will recieve the raw text of each page."},
            {"role": "user", "content": text}
        ]
    )
    
    content = response['choices'][0]['message']['content']
    return content

# Path to the PDF file
pdf_path = "bill_text.pdf"

# Open the PDF file
with fitz.open(pdf_path) as pdf:
    for page_num in range(len(pdf)):
        # Get a page of the PDF
        page = pdf[page_num]
        # Extract text from the page
        text = page.get_text()
        
        # Get a summary of the page using the OpenAI Chat Completions API
        summary = summarize_with_openai_chat(text)
        print(f"Summary of page {page_num + 1}:")
        print(summary)


Summary of page 1:
This bill, titled the "Justo R. Cortes Progressive Supranuclear Palsy Act," creates a policy workgroup for progressive supranuclear palsy and other neurodegenerative diseases. The State Surgeon General will be responsible for establishing the workgroup, which will have specific duties, membership, and meetings. The workgroup will submit annual reports and a final report to the Governor and the Legislature.
Summary of page 2:
Page 2 of the bill outlines the various tasks and goals to be achieved in relation to progressive supranuclear palsy and other neurodegenerative diseases. These include identifying the number of people diagnosed annually, collecting data on diagnoses and adverse health outcomes, understanding the impact of these diseases, determining the standard of care, researching treatments and therapies, developing a risk surveillance system, and making policy recommendations to improve patient awareness and detection.
Summary of page 3:
Page 3 of the bill o