# US States County Hospital Data Scraper - COMPREHENSIVE

This notebook scrapes detailed hospital data from HospitalStats.org for all 50 US states and their counties.

## Features
- **State & County Discovery**: Automatically discovers all 50 US states and their ~3000+ counties
- **Hospital Listings**: Extracts hospital names, cities, and ER wait times from county pages
- **Detailed Metrics**: Scrapes comprehensive data from each hospital's detail page:
  - Basic Information: Address, phone, hospital type, emergency services
  - Mortality Rates: Overall, heart attack, stroke, heart failure, pneumonia
  - Infection Data: C. Diff and MRSA case counts
  - ER Wait Times: Average time spent in emergency department
  - Patient Ratings: Overall rating, positive/negative feedback

## Usage
1. Run all cells sequentially
2. Adjust `MAX_COUNTIES` in the configuration cell to control scope
3. Set `ENRICH_DATA = True` to scrape detail pages (slower but more data)
4. Results will be saved to `out/us_states_TIMESTAMP/`

In [1]:
# Import required libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
import datetime as dt
from pathlib import Path
from typing import Optional, List, Dict
from urllib.parse import urljoin

print("✓ All libraries imported successfully")

✓ All libraries imported successfully


## Configuration

Adjust these settings to control the scraping behavior:

In [3]:
# Configuration Settings
BASE_URL = "https://www.hospitalstats.org"
DELAY = 0.8  # Delay between requests in seconds (be polite!)

# Scraping scope
MAX_COUNTIES = None  # Set to None to scrape all counties, or a number to limit (e.g., 5 for testing)
ENRICH_DATA = True  # Set to False to skip detail page scraping (faster but less data)

# Output directory
OUTPUT_DIR = Path("out")

print(f"Configuration:")
print(f"  - Base URL: {BASE_URL}")
print(f"  - Delay: {DELAY}s between requests")
print(f"  - Max counties: {MAX_COUNTIES if MAX_COUNTIES else 'ALL (~3000)'}")
print(f"  - Enrich with detail pages: {ENRICH_DATA}")
print(f"  - Output directory: {OUTPUT_DIR}")

Configuration:
  - Base URL: https://www.hospitalstats.org
  - Delay: 0.8s between requests
  - Max counties: ALL (~3000)
  - Enrich with detail pages: True
  - Output directory: out


## Helper Functions

These functions handle HTTP requests and HTML parsing:

In [4]:
def fetch_html(url: str, delay: float = DELAY) -> Optional[str]:
    """
    Fetch HTML content from a URL with error handling.
    
    Args:
        url: URL to fetch
        delay: Delay before request (seconds)
    
    Returns:
        HTML content as string, or None if error
    """
    time.sleep(delay)
    try:
        resp = requests.get(url, timeout=30)
        resp.raise_for_status()
        return resp.text
    except Exception as e:
        print(f"ERROR fetching {url}: {e}")
        return None

def parse_wait_time(text: str) -> Optional[int]:
    """
    Parse wait time text like '2h 15m' into minutes.
    
    Args:
        text: Wait time text (e.g., "2h 15m", "45m", "1h")
    
    Returns:
        Total minutes as integer, or None if parsing fails
    """
    if not text or text.strip() == "":
        return None
    
    total_minutes = 0
    
    # Match hours (e.g., "2h")
    h_match = re.search(r'(\d+)\s*h', text, re.IGNORECASE)
    if h_match:
        total_minutes += int(h_match.group(1)) * 60
    
    # Match minutes (e.g., "15m")
    m_match = re.search(r'(\d+)\s*m', text, re.IGNORECASE)
    if m_match:
        total_minutes += int(m_match.group(1))
    
    return total_minutes if total_minutes > 0 else None

print("✓ Helper functions defined")

✓ Helper functions defined


## Detail Page Parsing

This function extracts comprehensive data from individual hospital detail pages:

In [5]:
def parse_detail_page(html: str, url: str) -> dict:
    """
    Extract comprehensive metrics from a hospital's detail page:
    - Address, phone, hospital type, emergency services
    - Mortality rates (overall, heart attack, stroke, heart failure, pneumonia)
    - Infection cases (C. Diff, MRSA)
    - Average ER wait time
    - Patient ratings (overall, positive points, negative points)
    """
    from bs4 import NavigableString, Tag
    
    def clean(s):
        """Strip and normalize whitespace"""
        return re.sub(r'\s+', ' ', (s or "").strip())
    
    def to_percent(s):
        """Extract percentage from string"""
        if not s:
            return None
        m = re.search(r'(\d+(?:\.\d+)?)\s*%', str(s))
        return float(m.group(1)) if m else None
    
    def node_text(n) -> str:
        """Return visible text for any BeautifulSoup node"""
        if n is None:
            return ""
        if isinstance(n, NavigableString):
            return str(n)
        if hasattr(n, "get_text"):
            return n.get_text(" ", strip=True)
        return str(n)
    
    def safe_join(parts, sep=" "):
        """Join any list/iterable of nodes/strings safely as text"""
        return sep.join([node_text(p) for p in parts if p is not None and node_text(p)])
    
    def text_after_b(soup, label_regex):
        """Find <b>Label:</b> VALUE in the same parent"""
        for b in soup.find_all("b"):
            if re.search(label_regex, b.get_text(" ", strip=True), re.I):
                parts = []
                for sib in b.next_siblings:
                    if isinstance(sib, Tag) and sib.name == "br":
                        break
                    parts.append(sib)
                txt = clean(safe_join(parts))
                if txt:
                    return txt
        return None
    
    soup = BeautifulSoup(html, "html.parser")
    h1 = soup.find(["h1", "h2"])
    name = clean(h1.get_text(" ", strip=True)) if h1 else None

    # Address/phone block
    address = city = state = postal = phone = None
    left = soup.select('div[style*="float:left"][style*="width:40%"]')
    if left:
        addr_html = left[0]
        lines = [x for x in addr_html.get_text("\n", strip=True).split("\n") if x]
        if lines:
            address = clean(lines[0])
        if len(lines) >= 2:
            m = re.search(r"(.+?),\s*([A-Z]{2})\s+(\d{5}(?:-\d{4})?)", lines[1])
            if m:
                city, state, postal = clean(m.group(1)), m.group(2), m.group(3)
        b_phone = addr_html.find("b", string=re.compile(r"^\s*Phone\s*:\s*$", re.I))
        if b_phone:
            parts = []
            for sib in b_phone.next_siblings:
                if isinstance(sib, Tag) and sib.name == "br":
                    break
                parts.append(sib)
            maybe_phone = safe_join(parts)
            pm = re.search(r"\(?\d{3}\)?[ -]?\d{3}[ -]?\d{4}", maybe_phone)
            if pm:
                phone = pm.group(0)

    hosp_type = text_after_b(soup, r"Hospital\s*Type")
    emergency_services = text_after_b(soup, r"Emergency\s*Services")
    if emergency_services:
        es = emergency_services.upper()
        emergency_services = "YES" if "YES" in es else ("NO" if "NO" in es else emergency_services)

    # Quality section - Mortality rates
    quality_hdr = soup.find(id="Quality")
    mort_text = mort_pct = mort_dir = None
    ha = st = hf = pn = None
    if quality_hdr:
        span = quality_hdr.find_next("span", class_="bigstat")
        if span:
            mort_text = clean(span.get_text(" ", strip=True))
            m = re.search(r"(\d+(?:\.\d+)?)\s*%", mort_text)
            mort_pct = float(m.group(1)) if m else None
            d = re.search(r"\b(better|worse)\b", mort_text, re.I)
            mort_dir = d.group(1).lower() if d else None
        tbl = quality_hdr.find_next("table")
        if tbl:
            for tr in tbl.find_all("tr"):
                th = tr.find("th")
                tds = tr.find_all("td")
                if not th or not tds:
                    continue
                label = clean(th.get_text(" ", strip=True))
                pct = to_percent(tds[0].get_text(" ", strip=True))
                if re.search(r"Heart Attack", label, re.I):
                    ha = pct
                elif re.search(r"Stroke", label, re.I):
                    st = pct
                elif re.search(r"Heart Failure", label, re.I):
                    hf = pct
                elif re.search(r"Pneumonia", label, re.I):
                    pn = pct

    # Infections
    c_diff = mrsa = None
    inf = soup.find(id="infectious")
    if inf:
        tbl = inf.find_next("table")
        if tbl:
            for tr in tbl.find_all("tr"):
                tds = tr.find_all("td")
                if len(tds) != 2:
                    continue
                label = tds[0].get_text(" ", strip=True)
                val = re.sub(r"[^\d]", "", tds[1].get_text(" ", strip=True)) or None
                val = int(val) if val else None
                if re.search(r"C\.\s*Diff", label, re.I):
                    c_diff = val
                if re.search(r"MRSA", label, re.I):
                    mrsa = val

    # ER wait time
    avg_ed_min = None
    er = soup.find(id="erwait")
    if er:
        span = (er.find_parent() or er).find("span", "bigstat")
        if span:
            avg_ed_min = parse_wait_time(span.get_text(" ", strip=True))

    # Patient ratings
    overall_patient_rating = None
    positive_points = negative_points = None

    pr_hdr = soup.find(id="patientratings")
    if pr_hdr:
        # Overall rating
        span = pr_hdr.find_next("span", class_="bigstat")
        if span:
            overall_patient_rating = clean(span.get_text(" ", strip=True))

        # Positive box
        pos_h3 = soup.find("h3", string=re.compile(r"Positive\s+Patient\s+Ratings", re.I))
        if pos_h3:
            pos_box = pos_h3.find_parent("div")
            if pos_box:
                ul = pos_box.find("ul")
                if ul and ul.find_all("li"):
                    positive_points = "; ".join(
                        clean(li.get_text(" ", strip=True)) for li in ul.find_all("li")
                    )
                else:
                    raw = clean(pos_box.get_text(" ", strip=True))
                    raw = re.sub(r"^\s*Positive\s+Patient\s+Ratings\s*", "", raw, flags=re.I).strip()
                    if re.search(r"No\s+consistently\s+positive\s+ratings", raw, re.I):
                        positive_points = None
                    elif raw:
                        positive_points = raw

        # Negative box
        neg_h3 = soup.find("h3", string=re.compile(r"Negative\s+Patient\s+Ratings", re.I))
        if neg_h3:
            neg_box = neg_h3.find_parent("div")
            if neg_box:
                ul = neg_box.find("ul")
                if ul and ul.find_all("li"):
                    negative_points = "; ".join(
                        clean(li.get_text(" ", strip=True)) for li in ul.find_all("li")
                    )

        # Safety: if positive == negative (selector leak), blank out positive
        if positive_points and negative_points and positive_points == negative_points:
            positive_points = None

    return {
        "detail_url": url,
        "detail_name": name,
        "detail_address": address,
        "detail_city": city,
        "detail_state": state,
        "detail_zip": postal,
        "detail_phone": phone,
        "detail_hospital_type": hosp_type,
        "detail_emergency_services": emergency_services,
        "detail_mortality_overall_text": mort_text,
        "detail_mortality_overall_percent": mort_pct,
        "detail_mortality_overall_direction": mort_dir,
        "detail_mortality_heart_attack_percent": ha,
        "detail_mortality_stroke_percent": st,
        "detail_mortality_heart_failure_percent": hf,
        "detail_mortality_pneumonia_percent": pn,
        "detail_c_diff_cases": c_diff,
        "detail_mrsa_cases": mrsa,
        "detail_avg_time_in_ed_minutes": avg_ed_min,
        "detail_overall_patient_rating": overall_patient_rating,
        "detail_positive_patient_ratings": positive_points,
        "detail_negative_patient_ratings": negative_points,
    }

print("✓ Detail page parser defined")

✓ Detail page parser defined


## State and County Discovery

These functions discover all US states and their counties:

In [6]:
# All 50 US state abbreviations
US_ABBR = {
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
}

def get_united_state_links() -> pd.DataFrame:
    """Discover all 50 US states and their county list pages."""
    base_url = f"{BASE_URL}/ER-Wait-Time/"
    html = fetch_html(base_url, delay=0)
    if not html:
        return pd.DataFrame(columns=["state_abbr", "state_url"])
    
    soup = BeautifulSoup(html, "html.parser")
    
    # Find the section with the state list
    container = None
    for h3 in soup.find_all("h3"):
        if "Browse Emergency Room Stats by State" in h3.get_text(strip=True):
            container = h3.parent
            break
    anchors = container.find_all("a", href=True) if container else soup.find_all("a", href=True)

    rows = []
    for a in anchors:
        href = a["href"]
        # Looks like "IL-Counties.htm"
        if re.fullmatch(r"[A-Z]{2}-Counties\.htm", href):
            abbr = a.get_text(strip=True).upper() or href.split("-")[0]
            if abbr in US_ABBR:
                rows.append({
                    "state_abbr": abbr,
                    "state_url": urljoin(base_url, href)
                })
    return pd.DataFrame(rows).drop_duplicates().sort_values("state_abbr").reset_index(drop=True)


def discover_counties_for_state(state_abbr: str, state_url: str) -> pd.DataFrame:
    """
    For a given state page, return county name + absolute URL.
    Targets anchors like: <a href="Ada-County-ID.htm">Ada</a>
    """
    html = fetch_html(state_url)
    if not html:
        return pd.DataFrame(columns=["state_abbr", "county_name", "county_url"])
    
    soup = BeautifulSoup(html, "html.parser")
    content = soup.find(id="content") or soup
    
    rows = []
    for a in content.find_all("a", href=True):
        href = a["href"]
        # Pattern: "Ada-County-ID.htm"
        if re.search(r"-County-[A-Z]{2}\.htm$", href, re.I):
            county_name = a.get_text(strip=True)
            if county_name:
                # Build absolute URL with /ER-Wait-Time/ prefix
                county_url = urljoin(f"{BASE_URL}/ER-Wait-Time/", href)
                rows.append({
                    "state_abbr": state_abbr,
                    "county_name": county_name,
                    "county_url": county_url
                })
    
    return pd.DataFrame(rows).drop_duplicates().reset_index(drop=True)


def build_united_counties_df() -> pd.DataFrame:
    """Discover all US states and their counties."""
    print("Discovering US states...")
    df_states = get_united_state_links()
    print(f"Found {len(df_states)} states\n")
    
    all_parts = []
    for idx, row in df_states.iterrows():
        abbr = row["state_abbr"]
        state_url = row["state_url"]
        print(f"[{idx+1}/{len(df_states)}] Discovering counties in {abbr}...")
        df_counties = discover_counties_for_state(abbr, state_url)
        print(f"  → Found {len(df_counties)} counties")
        all_parts.append(df_counties)
    
    df_counties = pd.concat(all_parts, ignore_index=True) if all_parts else pd.DataFrame(
        columns=["state_abbr", "county_name", "county_url"]
    )
    
    print(f"\n✓ Total counties discovered: {len(df_counties)}")
    return df_counties

print("✓ State and county discovery functions defined")

✓ State and county discovery functions defined


## County Page Parsing

This function extracts hospital listings from county pages:

In [7]:
def parse_county_page(county_url: str) -> pd.DataFrame:
    """
    Parse a county page and extract hospital data.
    Returns DataFrame with columns: hospital_name, city, wait_text, wait_minutes, detail_url, source_url
    """
    html = fetch_html(county_url)
    if not html:
        return pd.DataFrame()
    
    soup = BeautifulSoup(html, "html.parser")
    content = soup.find(id="content") or soup
    
    rows = []
    for tr in content.find_all("tr"):
        tds = tr.find_all("td")
        if len(tds) < 3:
            continue
        
        # Column 0: Hospital name (with link to detail page)
        a = tds[0].find("a", href=True)
        if not a:
            continue
        hospital_name = a.get_text(strip=True)
        detail_href = a["href"]
        detail_url = urljoin(BASE_URL, detail_href)
        
        # Column 1: City
        city = tds[1].get_text(strip=True)
        
        # Column 2: Wait time
        wait_text = tds[2].get_text(strip=True)
        wait_minutes = parse_wait_time(wait_text)
        
        rows.append({
            "hospital_name": hospital_name,
            "city": city,
            "wait_text": wait_text,
            "wait_minutes": wait_minutes,
            "detail_url": detail_url,
            "source_url": county_url
        })
    
    return pd.DataFrame(rows)

print("✓ County page parser defined")

✓ County page parser defined


## Data Enrichment

This function enriches hospital data by visiting detail pages:

In [8]:
def enrich_df_inline(df: pd.DataFrame, delay: float = DELAY) -> pd.DataFrame:
    """
    Enrich a DataFrame by visiting each hospital's detail page.
    Adds detailed metrics columns to the DataFrame.
    """
    if df.empty or "detail_url" not in df.columns:
        return df
    
    detail_cols = [
        "detail_name", "detail_address", "detail_city", "detail_state", "detail_zip",
        "detail_phone", "detail_hospital_type", "detail_emergency_services",
        "detail_mortality_overall_text", "detail_mortality_overall_percent",
        "detail_mortality_overall_direction", "detail_mortality_heart_attack_percent",
        "detail_mortality_stroke_percent", "detail_mortality_heart_failure_percent",
        "detail_mortality_pneumonia_percent", "detail_c_diff_cases", "detail_mrsa_cases",
        "detail_avg_time_in_ed_minutes", "detail_overall_patient_rating",
        "detail_positive_patient_ratings", "detail_negative_patient_ratings"
    ]
    
    # Initialize columns
    for col in detail_cols:
        df[col] = None
    
    print(f"  Enriching {len(df)} hospitals with detail page data...")
    
    for idx, row in df.iterrows():
        detail_url = row["detail_url"]
        if pd.isna(detail_url):
            continue
        
        html = fetch_html(detail_url, delay=delay)
        if not html:
            continue
        
        try:
            detail_data = parse_detail_page(html, detail_url)
            for col in detail_cols:
                if col in detail_data:
                    df.at[idx, col] = detail_data[col]
        except Exception as e:
            print(f"    ERROR parsing detail page {detail_url}: {e}")
    
    # Add scrape timestamp
    df["scrape_ts"] = dt.datetime.now(dt.timezone.utc).isoformat()
    
    return df

print("✓ Data enrichment function defined")

✓ Data enrichment function defined


## Main Scraping Function

This function orchestrates the scraping of all counties:

In [9]:
def scrape_all_counties(df_counties: pd.DataFrame, max_counties: int = None, enrich: bool = True) -> pd.DataFrame:
    """
    Scrape hospital data from all county pages.
    
    Args:
        df_counties: DataFrame with county information
        max_counties: Limit to N counties (for testing), or None for all
        enrich: Whether to enrich with detail page data
    
    Returns:
        DataFrame with all hospital data
    """
    if max_counties:
        df_counties = df_counties.head(max_counties)
    
    all_hospitals = []
    total = len(df_counties)
    
    print(f"\nScraping {total} counties...")
    print("=" * 60)
    
    for idx, row in df_counties.iterrows():
        state_abbr = row["state_abbr"]
        county_name = row["county_name"]
        county_url = row["county_url"]
        
        print(f"[{idx+1}/{total}] {state_abbr} - {county_name}...", end=" ")
        
        df_hospitals = parse_county_page(county_url)
        
        if df_hospitals.empty:
            print("No hospitals found")
            continue
        
        # Add state and county info
        df_hospitals["state_abbr"] = state_abbr
        df_hospitals["county_name"] = county_name
        
        # Enrich with detail page data if requested
        if enrich:
            df_hospitals = enrich_df_inline(df_hospitals)
        
        print(f"Found {len(df_hospitals)} hospitals")
        all_hospitals.append(df_hospitals)
    
    print("=" * 60)
    
    if not all_hospitals:
        print("No hospitals found!")
        return pd.DataFrame()
    
    df_all = pd.concat(all_hospitals, ignore_index=True)
    print(f"\n✓ Total hospitals scraped: {len(df_all)}")
    
    return df_all

print("✓ Main scraping function defined")

✓ Main scraping function defined


## Export and Summary Functions

These functions handle data export and summary statistics:

In [10]:
def export_results(df_hospitals: pd.DataFrame, outdir: Path) -> pd.DataFrame:
    """
    Export hospital data to CSV with specific column order.
    
    Args:
        df_hospitals: DataFrame with hospital data
        outdir: Output directory path
    
    Returns:
        DataFrame with reordered columns
    """
    # Define column order
    base_cols = [
        "state_abbr", "county_name", "hospital_name", "city",
        "wait_text", "wait_minutes", "detail_url"
    ]
    
    detail_cols = [
        "detail_name", "detail_address", "detail_city", "detail_state", "detail_zip",
        "detail_phone", "detail_hospital_type", "detail_emergency_services",
        "detail_mortality_overall_text", "detail_mortality_overall_percent",
        "detail_mortality_overall_direction", "detail_mortality_heart_attack_percent",
        "detail_mortality_stroke_percent", "detail_mortality_heart_failure_percent",
        "detail_mortality_pneumonia_percent", "detail_c_diff_cases", "detail_mrsa_cases",
        "detail_avg_time_in_ed_minutes", "detail_overall_patient_rating",
        "detail_positive_patient_ratings", "detail_negative_patient_ratings"
    ]
    
    # Reorder columns
    ordered_cols = base_cols + detail_cols + ["source_url", "scrape_ts"]
    existing_cols = [c for c in ordered_cols if c in df_hospitals.columns]
    df_hospitals = df_hospitals[existing_cols]
    
    # Export
    csv_path = outdir / "us_hospitals_data_enriched.csv"
    df_hospitals.to_csv(csv_path, index=False, encoding="utf-8")
    print(f"\n✓ Data exported to: {csv_path}")
    
    return df_hospitals


def print_summary(df_hospitals: pd.DataFrame, outdir: Path):
    """Print summary statistics about the scraped data."""
    print("\n" + "=" * 60)
    print("SUMMARY STATISTICS")
    print("=" * 60)
    
    print(f"Total hospitals: {len(df_hospitals)}")
    print(f"States covered: {df_hospitals['state_abbr'].nunique()}")
    print(f"Counties covered: {df_hospitals['county_name'].nunique()}")
    
    if "wait_minutes" in df_hospitals.columns:
        avg_wait = df_hospitals["wait_minutes"].mean()
        if not pd.isna(avg_wait):
            print(f"Average ER wait time: {avg_wait:.1f} minutes")
    
    if "detail_mortality_overall_percent" in df_hospitals.columns:
        enriched = df_hospitals["detail_mortality_overall_percent"].notna().sum()
        print(f"Hospitals with detailed metrics: {enriched}")
    
    print(f"\nOutput directory: {outdir}")
    print("=" * 60)

print("✓ Export and summary functions defined")

✓ Export and summary functions defined


---

## RUN THE SCRAPER

Execute the cells below to run the comprehensive scraper:

### Step 1: Discover Counties

This will discover all US states and their counties:

In [11]:
# Step 1: Discover all counties
df_counties = build_united_counties_df()

# Preview the counties
print("\nFirst 10 counties:")
print(df_counties.head(10))

Discovering US states...
Found 50 states

[1/50] Discovering counties in AK...
  → Found 12 counties
[2/50] Discovering counties in AL...
  → Found 58 counties
[3/50] Discovering counties in AR...
  → Found 53 counties
[4/50] Discovering counties in AZ...
  → Found 14 counties
[5/50] Discovering counties in CA...
  → Found 56 counties
[6/50] Discovering counties in CO...
  → Found 40 counties
[7/50] Discovering counties in CT...
  → Found 8 counties
[8/50] Discovering counties in DE...
  → Found 3 counties
[9/50] Discovering counties in FL...
  → Found 56 counties
[10/50] Discovering counties in GA...
  → Found 106 counties
[11/50] Discovering counties in HI...
  → Found 4 counties
[12/50] Discovering counties in IA...
  → Found 88 counties
[13/50] Discovering counties in ID...
  → Found 28 counties
[14/50] Discovering counties in IL...
  → Found 75 counties
[15/50] Discovering counties in IN...
  → Found 75 counties
[16/50] Discovering counties in KS...
  → Found 80 counties
[17/50] D

### Step 2: Setup Output Directory

Create a timestamped output directory:

In [12]:
# Step 2: Setup output directory
stamp = dt.datetime.now(dt.timezone.utc).strftime("%Y%m%d_%H%M%S")
outdir = OUTPUT_DIR / f"us_states_{stamp}"
outdir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {outdir}")

# Save counties list
counties_csv = outdir / "counties_list.csv"
df_counties.to_csv(counties_csv, index=False)
print(f"Counties list saved to: {counties_csv}")

Output directory: out\us_states_20251018_165814
Counties list saved to: out\us_states_20251018_165814\counties_list.csv


### Step 3: Scrape Hospital Data

This will scrape hospital data from all counties (or limited by MAX_COUNTIES):

**Note:** This may take a while depending on MAX_COUNTIES setting!

In [13]:
# Step 3: Scrape hospital data
df_hospitals = scrape_all_counties(
    df_counties,
    max_counties=MAX_COUNTIES,
    enrich=ENRICH_DATA
)

# Preview the data
print("\nFirst 5 hospitals:")
print(df_hospitals.head())


Scraping 2400 counties...
[1/2400] AK - Anchorage...   Enriching 5 hospitals with detail page data...
Found 5 hospitals
[2/2400] AK - Bethel... No hospitals found
[3/2400] AK - Dillingham... No hospitals found
[4/2400] AK - Fairbanks North Star...   Enriching 2 hospitals with detail page data...
Found 2 hospitals
[5/2400] AK - Juneau...   Enriching 1 hospitals with detail page data...
Found 1 hospitals
[6/2400] AK - Kenai Peninsula...   Enriching 3 hospitals with detail page data...
Found 3 hospitals
[7/2400] AK - Ketchikan Gateway...   Enriching 1 hospitals with detail page data...
Found 1 hospitals
[8/2400] AK - Kodiak Island... No hospitals found
[9/2400] AK - Matanuska Susitna...   Enriching 1 hospitals with detail page data...
Found 1 hospitals
[10/2400] AK - Nome... No hospitals found
[11/2400] AK - Sitka... No hospitals found
[12/2400] AK - Valdez Cordova...   Enriching 2 hospitals with detail page data...
Found 2 hospitals
[13/2400] AL - Autauga...   Enriching 1 hospitals with

### Step 4: Export Results

Export the scraped data to CSV:

In [14]:
# Step 4: Export results
df_hospitals = export_results(df_hospitals, outdir)


✓ Data exported to: out\us_states_20251018_165814\us_hospitals_data_enriched.csv


### Step 5: Print Summary

Display summary statistics:

In [15]:
# Step 5: Print summary
print_summary(df_hospitals, outdir)


SUMMARY STATISTICS
Total hospitals: 4088
States covered: 50
Counties covered: 1326
Average ER wait time: 162.1 minutes
Hospitals with detailed metrics: 3173

Output directory: out\us_states_20251018_165814


---

## Data Analysis:

Explore the scraped data with these analysis cells:

### View Data Info

Check the structure and completeness of the data:

In [16]:
# View data structure
print("Data shape:", df_hospitals.shape)
print("\nColumn names:")
print(df_hospitals.columns.tolist())
print("\nData types:")
print(df_hospitals.dtypes)
print("\nMissing values:")
print(df_hospitals.isnull().sum())

Data shape: (4088, 30)

Column names:
['state_abbr', 'county_name', 'hospital_name', 'city', 'wait_text', 'wait_minutes', 'detail_url', 'detail_name', 'detail_address', 'detail_city', 'detail_state', 'detail_zip', 'detail_phone', 'detail_hospital_type', 'detail_emergency_services', 'detail_mortality_overall_text', 'detail_mortality_overall_percent', 'detail_mortality_overall_direction', 'detail_mortality_heart_attack_percent', 'detail_mortality_stroke_percent', 'detail_mortality_heart_failure_percent', 'detail_mortality_pneumonia_percent', 'detail_c_diff_cases', 'detail_mrsa_cases', 'detail_avg_time_in_ed_minutes', 'detail_overall_patient_rating', 'detail_positive_patient_ratings', 'detail_negative_patient_ratings', 'source_url', 'scrape_ts']

Data types:
state_abbr                                 object
county_name                                object
hospital_name                              object
city                                       object
wait_text                         

### Top 10 Hospitals by Wait Time

Find hospitals with the shortest ER wait times:

In [17]:
# Top 10 hospitals with shortest wait times
if "wait_minutes" in df_hospitals.columns:
    top_10 = df_hospitals.nsmallest(10, "wait_minutes")[
        ["hospital_name", "city", "state_abbr", "wait_text", "wait_minutes"]
    ]
    print("Top 10 Hospitals with Shortest Wait Times:")
    print(top_10.to_string(index=False))
else:
    print("Wait time data not available")

Top 10 Hospitals with Shortest Wait Times:
                    hospital_name          city state_abbr wait_text  wait_minutes
 Lady Of The Sea General Hospital       Cut Off         LA    0h 45m          45.0
                Hopedale Hospital      Hopedale         IL    0h 50m          50.0
         Big Sandy Medical Center     Big Sandy         MT    0h 53m          53.0
Mitchell County Hospital District Colorado City         TX    0h 54m          54.0
Samuel Mahelona Memorial Hospital         Kapaa         HI    0h 56m          56.0
        Richardson Medical Center      Rayville         LA    0h 58m          58.0
          Keller Ach (West Point)    West Point         NY     1h 0m          60.0
    Mercy Hospital Tishomingo Inc    Tishomingo         OK     1h 0m          60.0
      Mccone County Health Center        Circle         MT     1h 2m          62.0
                Sanford Hillsboro     Hillsboro         ND     1h 2m          62.0


### State-by-State Comparison

Compare average wait times across states:

In [18]:
# State-by-state comparison
if "wait_minutes" in df_hospitals.columns and "state_abbr" in df_hospitals.columns:
    state_stats = df_hospitals.groupby("state_abbr")["wait_minutes"].agg([
        ("count", "count"),
        ("avg_wait", "mean"),
        ("min_wait", "min"),
        ("max_wait", "max")
    ]).round(1).sort_values("avg_wait")
    
    print("State-by-State Wait Time Statistics:")
    print(state_stats.to_string())
else:
    print("State comparison data not available")

State-by-State Wait Time Statistics:
            count  avg_wait  min_wait  max_wait
state_abbr                                     
ND             42     111.3      62.0     243.0
NE             76     118.9      73.0     260.0
SD             31     121.3      79.0     222.0
MT             45     123.3      53.0     215.0
KS             98     123.6      71.0     316.0
OK             95     124.3      60.0     259.0
HI             19     126.8      56.0     219.0
IA             98     127.4      74.0     228.0
WY             19     130.7      75.0     200.0
UT             44     134.7      89.0     255.0
MN            106     136.5      73.0     264.0
MS             67     137.3      71.0     587.0
AR             68     137.5      71.0     300.0
WI            117     138.8      74.0     259.0
LA             64     140.5      45.0     316.0
ID             32     140.9      99.0     224.0
CO             69     143.4      86.0     233.0
AK             14     145.5     106.0     192.0
NV 

### Mortality Rate Analysis

Analyze hospitals by mortality rates (if enriched data is available):

In [20]:
# Mortality rate analysis
if "detail_mortality_overall_percent" in df_hospitals.columns:
    mortality_data = df_hospitals[df_hospitals["detail_mortality_overall_percent"].notna()]
    
    if not mortality_data.empty:
        print(f"Hospitals with mortality data: {len(mortality_data)}")
        print(f"\nMortality Rate Statistics:")
        print(f"  Average: {mortality_data['detail_mortality_overall_percent'].mean():.2f}%")
        print(f"  Median: {mortality_data['detail_mortality_overall_percent'].median():.2f}%")
        print(f"  Min: {mortality_data['detail_mortality_overall_percent'].min():.2f}%")
        print(f"  Max: {mortality_data['detail_mortality_overall_percent'].max():.2f}%")
    else:
        print("No mortality data available")
else:
    print("Mortality data not available (enrichment may be disabled)")

Hospitals with mortality data: 3173

Mortality Rate Statistics:
  Average: 8.44%
  Median: 7.00%
  Min: 0.00%
  Max: 69.00%


---

## Done! 🎉

Your comprehensive hospital data has been scraped and saved.

### Next Steps:
1. Check the output directory for CSV files
2. Use the CSV in the app
3. Adjust `MAX_COUNTIES` and `ENRICH_DATA` settings to scrape more data

### Files Generated:
- `counties_list.csv` - List of all discovered counties
- `us_hospitals_data_enriched.csv` - Complete hospital data with all metrics