# Selenium sample

## Prerequisites
- Before running this code, you may want to set up an python environment if you are using a local working folder by running these.
    - Install uv (https://docs.astral.sh/uv/getting-started/installation/#standalone-installer)
    - Move to the repository folder using Terminal (Mac) or Powershell (Windows)
        - ```cd #path-to-your-repository (GraSPP-25S-climatechange)#```
    - Set up or update your python environment by hitting ```uv sync```

## How to use
- For each webpage you want to scrape, set up parameters specified as ```###CHANGE THIS###```. The most tricky part is to specify the location in the page for getting text. Please refer to this page or LLMs.
    - https://www.tutorialspoint.com/beautiful_soup/beautiful_soup_select_method.htm

## Known issue
- This code is reading only the first page of news list page, need to read all pages (up to 1,247 pages for Biden White House case)
- Need to check if the keywords are detected in a case-sensitive way

In [12]:
# if you use google colab, run this instead
# !pip install selenium

In [28]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup

# Set keywords to search
### CHANGE THIS###
KEYWORDS = ["Greenhouse Gas", "GHG", "Net-zero", "Carbon neutral", "nation"]
BASE_URL = "https://bidenwhitehouse.archives.gov/briefing-room/"

# Setup Chrome options
options = Options()
options.add_argument("--headless")  # Run headless for efficiency
options.add_argument("--disable-gpu") # Recommended for headless on some systems
options.add_argument("--no-sandbox") # Bypass OS security model, necessary for some environments

driver = webdriver.Chrome(options=options)

# Load the main briefing room page
print(f"Loading base URL: {BASE_URL}")
driver.get(BASE_URL)

# Wait for the content to load
wait = WebDriverWait(driver, 20) # Increased wait time

# Scroll and load more articles
def scroll_to_load(max_scrolls=10):
    print("Scrolling to load more articles...")
    last_height = driver.execute_script("return document.body.scrollHeight")
    for i in range(max_scrolls):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)  # Increased sleep to ensure content loads
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            print(f"Reached end of scrollable content after {i+1} scrolls.")
            break
        last_height = new_height
    print("Finished scrolling.")

scroll_to_load()

# Parse the page source
soup = BeautifulSoup(driver.page_source, "html.parser")

# Find article links
# Updated selector based on typical structure of such archive sites
# Look for 'a' tags within elements that likely contain article previews/summaries
# This is a common pattern:
# <div class="briefing-room__content-item">
#   <a href="...">...</a>
# </div>
# Or directly look for links within a general content area
# A more robust approach would be to examine the current HTML structure of the site.
# For many government archive sites, links to articles are within list items or divs
# that clearly indicate a news item.
# Based on typical structures, let's try a few common patterns.
# If this doesn't work, direct inspection of the target website's HTML is crucial.

# Try a more specific selector that usually targets article links on such sites
# A common pattern is div.briefing-room__content-item > a
# Or simply look for a links that are direct children of content items
### CHANGE THIS###
articles = soup.select("h2 > a")

# Filter for full URLs that point to actual articles (not just anchors or internal links)
article_links = []
for a in articles:
    href = a.get('href')
    # Ensure it's an absolute URL and starts with the expected base path for articles
    ### CHANGE THIS###
    if href and href.startswith("/briefing-room/"):
        article_links.append(href.replace("/briefing-room/", ""))

# Using a set to remove duplicates, then converting back to list
article_links = list(set(article_links))
print(f"Found {len(article_links)} unique articles.")

# Dictionary to store keyword counts
keyword_counts = {kw: 0 for kw in KEYWORDS}

# Visit each article and search for keywords
for i, link in enumerate(article_links):
    print(f"Processing article {i+1}/{len(article_links)}")
    try:
        link = BASE_URL + link
        driver.get(link)
        # Wait for the main article content to be present
        # This might be an <article> tag, or a specific div containing the article text
        ### CHANGE THIS###
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.container"))) # Common for article bodies
        article_html = driver.page_source
        article_soup = BeautifulSoup(article_html, "html.parser")

        # Extract text from the main content area of the article
        ### CHANGE THIS###
        article_content_div = article_soup.select("#content > article > section > div > div")
        if article_content_div:
            for content_div in article_content_div:
                article_text = content_div.get_text()
                print(article_text)
                for kw in KEYWORDS:
                    keyword_counts[kw] += article_text.lower().count(kw.lower())
            print(f"Scanned: {link}")
        else:
            print(f"Could not find main content for: {link}")

    except (TimeoutException, NoSuchElementException) as e:
        print(f"Failed to load or parse {link} due to: {e}")
        continue
    except Exception as e:
        print(f"An unexpected error occurred for {link}: {e}")
        continue


# Close the driver
driver.quit()

# Print final keyword counts
print("\nKeyword Occurrences:")
for kw, count in keyword_counts.items():
    print(f"{kw}: {count}")

Loading base URL: https://bidenwhitehouse.archives.gov/briefing-room/
Scrolling to load more articles...
Reached end of scrollable content after 1 scrolls.
Finished scrolling.
Found 10 unique articles.
Processing article 1/10

Our nation relies on dedicated, selfless public servants every day. They are the lifeblood of our democracy.
Yet alarmingly, public servants have been subjected to ongoing threats and intimidation for faithfully discharging their duties.
In certain cases, some have even been threatened with criminal prosecutions, including General Mark A. Milley, Dr. Anthony S. Fauci, and the members and staff of the Select Committee to Investigate the January 6th Attack on the United States Capitol. These public servants have served our nation with honor and distinction and do not deserve to be the targets of unjustified and politically motivated prosecutions.
General Milley served our nation for more than 40 years, serving in multiple command and leadership posts and deploying 