# Socioeconomic and Demographic Drivers of Ambulatory Surgery Usage
### HCUP NASS 2020 – Reproducible Pipeline (Python + R)

**Author:** Seena Khosravi, MD  
**LLMs Utilized:** Claude Sonnet 4, Opus 4; ChatGPT 4o, o4; Deepseek 3.1; Gemini 2.5 Pro  
**Last Updated:** September 13, 2025  

**Data Source:**  
Department of Health & Human Services (HHS)  
Agency for Healthcare Research and Quality (AHRQ)  
Healthcare Cost and Utilization Project (HCUP)  
National Ambulatory Surgical Sample (NASS)
Year - 2020

---


## Overview
This notebook provides a reproducible analysis pipeline for examining socioeconomic and demographic factors influencing ambulatory surgery usage patterns. The analysis combines Python for data processing and R for statistical modeling.

### Data Usage Agreement
**DUA Compliant Online Implementation** — This notebook uses a simulated, artificial, smaller dataset with identical structure to the file created by [Raw_NASS_Processing.R](https://github.com/SeenaKhosravi/NASS/blob/a7764ce80be8a82fc449831821c27d957176c410/Raw%20NASS%20%20Processing.R). The simulated dataset production methodology is also in [Generate_Simulated_NASS.R](https://github.com/SeenaKhosravi/NASS/blob/161bf2b5c149da9654c0e887655b361fa2176db0/Generate_Simulated_NASS.R). If DUA signed and data purchased from HCUP, this notebook can run on full dataset loaded from your local or cloud storage.

[Please see the DUA Agreement here.](https://hcup-us.ahrq.gov/team/NationwideDUA.jsp)

### Key Features
- **Multiple Platform:** Works on jupyter implementations via local environments, server, cloud VM instance, or platform as a service.
- **Flexible Data Storage:** GitHub (simulated, static, open access), Google Drive, Google Cloud Storage, or local file
- **Reproducible:** All dependencies and environment setup included; assumes new, unmodified colab/vertex instances
- **Scalable:** Handles both simulated (0.2GB, 139k rows) and full dataset (12 GB, 7.8M rows). Scalable cloud options.

---

## Design Notes

### Architecture
- **Python primary, w/ R run via rpy2 python extension**
- **Python cells** handle "plumbing" (file I/O, environment setup, rpy2 configuration, data previews)
- **R cells** (prefixed by `%%R`) perform statistical analysis: survey weights, Census lookups, multilevel models, plots, classifiers, etc.

### Data Sources
- **Default:** Simulated dataset (1GB) from GitHub releases
- **Local:** Switch to locally stored files via configuration
- **Drive:** Google Drive (Only availble in Colab)
- **Cloud:** Google Cloud Storage support for large datasets


### Environment Support
- Local (Jupyterlab w/ Python 3.11.5 kernel)
- Jupyter Server (may require some configuring depending on your implementation)
- Google Colab (Pro recommended, high-ram option)
- Vertex AI Workbench (JupyterLab 3, Python 3 kernel) (used for full analysis)

---

# Setup

---
## 1. Configuration

Configure all settings here prior to run - data sources, debugging options, and file paths. Defaults to simulated dataset.

In [None]:
# ==================== CONFIGURATION ====================
# Data Source Options
DATA_SOURCE = "github"      # Options: "github", "local", "gcs", "drive"
VERBOSE_PRINTS = True       # False → suppress debug output

# GitHub source (default - simulated data)
GITHUB_URL = "https://github.com/SeenaKhosravi/NASS/releases/download/v1.0.0/nass_2020_simulated.csv"

# Local file options
LOCAL_FILENAME = "nass_2020_local.csv"

# Google Cloud Storage options
GCS_BUCKET = "nass_2020"
GCS_BLOB = "nass_2020_all.csv"
GCS_SERVICE_ACCOUNT_KEY = "/path/to/service-account-key.json"  # Optional

# Google Drive options (for Colab)
DRIVE_PATH = "/content/drive/MyDrive/NASS/nass_2020_full.csv"
# ======================================================

print("✓ Configuration loaded")
print(f"  Data source: {DATA_SOURCE}")
print(f"  Verbose mode: {VERBOSE_PRINTS}")

---
## 2. Environment Setup & Package Installation

Detect environment, and install python packages, via Conda if present, with fallbacks. If in colab, mount google drive.

Note: If in Vertex VM instance for the first time, you likely need to install R. Please skip following cell, and return after completing subsequent cell.

In [None]:
import os
import sys
import subprocess
from pathlib import Path

class EnvironmentManager:
    def __init__(self):
        self.detect_environment()
        self.setup_packages()

    def detect_environment(self):
        """Detect runtime environment"""
        self.is_colab = 'COLAB_GPU' in os.environ or 'google.colab' in sys.modules
        self.is_vertex = 'DL_ANACONDA_HOME' in os.environ

        if self.is_colab:
            self.env_type = "Google Colab"
        elif self.is_vertex:
            self.env_type = "Vertex AI"
        else:
            self.env_type = "Local/Jupyter"

        print(f"Environment: {self.env_type}")

    def check_conda_available(self):
        """Check if conda is available"""
        try:
            subprocess.check_call(['conda', '--version'],
                                stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            return True
        except (subprocess.CalledProcessError, FileNotFoundError):
            return False

    def install_package(self, package, conda_name=None):
        """Smart package installation with fallback"""
        try:
            __import__(package)
            return True
        except ImportError:
            print(f"Installing {package}...")

            # Try conda first if available and not in Colab
            if conda_name and not self.is_colab and self.check_conda_available():
                try:
                    subprocess.check_call(['conda', 'install', '-y', conda_name],
                                        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                    return True
                except subprocess.CalledProcessError:
                    print(f"  Conda install failed for {conda_name}, trying pip...")

            # Fallback to pip
            try:
                subprocess.check_call([sys.executable, '-m', 'pip', 'install', package],
                                    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                return True
            except subprocess.CalledProcessError as e:
                print(f"  Pip install failed for {package}: {e}")
                return False

    def setup_packages(self):
        """Install required packages efficiently"""
        packages = {
            'pandas': 'pandas',
            'requests': 'requests',
            'rpy2': 'rpy2',
            'google.cloud.storage': 'google-cloud-storage'
        }

        print("Installing/checking packages...")
        failed = []

        for pkg, install_name in packages.items():
            if not self.install_package(pkg, install_name):
                failed.append(pkg)

        # Store failed packages globally for recovery
        globals()['failed_packages'] = failed

        if failed:
            print(f"⚠️  Failed to install: {', '.join(failed)}")
            print("Some features may not work")

            # Provide specific guidance for rpy2
            if 'rpy2' in failed:
                print("\n💡 For rpy2 installation issues:")
                if self.is_vertex:
                    print("   - Vertex AI: R may not be installed by default")
                    print("   - Run the next cell for automated R setup")
                else:
                    print("   - On Windows: May need Visual Studio Build Tools")
                    print("   - Try: conda install -c conda-forge rpy2")
                    print("   - Or: pip install rpy2 (requires R to be installed)")
        else:
            print("✓ All packages ready")

        # Mount Google Drive if needed (check if DATA_SOURCE exists)
        try:
            if globals().get('DATA_SOURCE') == "drive" and self.is_colab:
                self.mount_drive()
        except NameError:
            pass  # DATA_SOURCE not defined yet

    def mount_drive(self):
        """Mount Google Drive in Colab"""
        try:
            from google.colab import drive
            drive.mount('/content/drive')
            print("✓ Google Drive mounted")
        except:
            print("❌ Failed to mount Google Drive")

# Initialize environment
env_manager = EnvironmentManager()

If Running in Vertex Workbench Instance for first time (or if last cell gave error), run this cell to install R, and re-run last cell.

Otherwise, skip this cell.

In [None]:
# Streamlined R Installation for Vertex AI (and other Linux environments)
def install_latest_r():
    """Install or verify latest R installation"""
    print("🔧 R Installation Check...")

    # Check current R version
    r_version = None
    try:
        result = subprocess.check_output(['R', '--version'], text=True)
        r_version = result.split('\n')[0]
        print(f"✓ R found: {r_version}")

        # Check if it's reasonably recent (R 4.0+)
        if "R version 4." in r_version and not r_version.startswith("R version 4.0"):
            print("✓ R version is current - no update needed")
            return True
        elif "R version 4.0" in r_version:
            print("⚠️ R 4.0 detected - will upgrade to latest for better package compatibility")
        else:
            print("⚠️ Older R version detected - will upgrade")

    except (subprocess.CalledProcessError, FileNotFoundError):
        print("❌ R not found - installing latest version")

    # Install/upgrade R
    print("📦 Installing latest R from CRAN...")

    commands = [
        # Update system
        "sudo apt-get update -qq",

        # Install prerequisites
        "sudo apt-get install -y software-properties-common dirmngr",

        # Add CRAN key and repository
        "sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9",

        # Add repository (auto-detect Ubuntu version)
        f"sudo add-apt-repository -y 'deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/'",

        # Update and install R
        "sudo apt-get update -qq",
        "sudo apt-get install -y r-base r-base-dev",

        # Install essential build tools
        "sudo apt-get install -y build-essential libcurl4-openssl-dev libssl-dev libxml2-dev"
    ]

    for cmd in commands:
        try:
            subprocess.run(cmd, shell=True, check=True,
                         stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        except subprocess.CalledProcessError as e:
            print(f"⚠️ Command failed: {cmd}")
            if "add-apt-repository" in cmd:
                # Fallback to Ubuntu repo if CRAN fails
                print("📦 Falling back to Ubuntu repository R...")
                subprocess.run("sudo apt-get install -y r-base r-base-dev",
                             shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                break
            continue

    # Verify installation
    try:
        result = subprocess.check_output(['R', '--version'], text=True)
        new_version = result.split('\n')[0]
        print(f"✅ R installation complete: {new_version}")
        return True
    except:
        print("❌ R installation verification failed")
        return False

def install_rpy2():
    """Install rpy2 with multiple fallback methods"""
    print("🐍 Installing rpy2...")

    methods = [
        ([sys.executable, '-m', 'pip', 'install', '--upgrade', 'pip', 'setuptools'], "upgrade pip"),
        ([sys.executable, '-m', 'pip', 'install', 'rpy2'], "pip install rpy2"),
        (['conda', 'install', '-c', 'conda-forge', 'rpy2', '-y'], "conda install rpy2")
    ]

    for cmd, desc in methods:
        try:
            subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            # Test import
            import rpy2
            print(f"✅ rpy2 installed successfully via {desc}")
            return True
        except (subprocess.CalledProcessError, ImportError, FileNotFoundError):
            continue

    print("❌ rpy2 installation failed")
    return False

# Main execution
if env_manager.is_vertex or 'rpy2' in globals().get('failed_packages', []):
    print("=" * 60)

    # Install R if needed
    r_success = install_latest_r()

    # Install rpy2 if needed
    if r_success and 'rpy2' in globals().get('failed_packages', []):
        rpy2_success = install_rpy2()
        if rpy2_success:
            print("🎉 Setup complete! Please restart Python kernel and re-run environment setup.")

    print("=" * 60)
else:
    print("✓ Skipping R installation (not needed for this environment)")

---
## 3. R Environment Setup

Load R integration and install R packages efficiently.

In [None]:
# Load rpy2 extension for R integration
try:
    %load_ext rpy2.ipython
    print("✓ R integration loaded")
    globals()['R_AVAILABLE'] = True
except Exception as e:
    print(f"❌ Failed to load R integration: {e}")

    # Windows-specific troubleshooting
    if "R.dll" in str(e) or "error 0x7e" in str(e):
        print("\n💡 Windows R.dll loading issue detected:")
        print("   This is a common Windows + rpy2 compatibility issue")
        print("   Solutions:")
        print("   1. Restart Python kernel and try again")
        print("   2. Check R version compatibility with rpy2")
        print("   3. Try reinstalling R and rpy2")
        print("   4. Use Python-only analysis (fallback available)")
        globals()['R_AVAILABLE'] = False
    else:
        print("Install rpy2: pip install rpy2")
        globals()['R_AVAILABLE'] = False

Install Essential R packages, try for a few optional packages.

Note: Other packages needed for specific analysis (ie. advanced modeling packages) will be installed and called as needed later in the notebook.

In [None]:
%%R -i VERBOSE_PRINTS

# Environment-aware R package setup
# Standard approach, then Vertex AI fallback if needed

# Detect environment
is_vertex <- Sys.getenv("DL_ANACONDA_HOME") != ""
is_colab <- Sys.getenv("COLAB_GPU") != ""

if(is_colab) {
  cat(" Google Colab detected...\n")
} else if(is_vertex) {
  cat(" Vertex AI detected...\n")
} else {
  cat(" Local environment detected...\n")
}

# Standard clean setup (for all environments initially)
essential_packages <- c(
  "data.table",    # Fast data manipulation
  "ggplot2",       # Plotting
  "scales"         # For ggplot2 percentage scales
)

optional_packages <- c(
  "survey",        # Survey statistics
  "broom"          # Model tidying
)

# Fast installation settings
repos <- "https://cloud.r-project.org"
options(repos = repos)
Sys.setenv(MAKEFLAGS = paste0("-j", parallel::detectCores()))

# Package check and load functions
pkg_available <- function(pkg) {
  tryCatch({
    find.package(pkg, quiet = TRUE)
    TRUE
  }, error = function(e) FALSE)
}

load_pkg <- function(pkg) {
  tryCatch({
    suppressMessages(library(pkg, character.only = TRUE, quietly = TRUE))
    TRUE
  }, error = function(e) FALSE)
}

# Install missing essential packages
missing_essential <- essential_packages[!sapply(essential_packages, pkg_available)]

if(length(missing_essential) > 0) {
  cat("Installing essential packages:", paste(missing_essential, collapse = ", "), "\n")

  tryCatch({
    install.packages(missing_essential,
                    repos = repos,
                    type = getOption("pkgType"),
                    dependencies = FALSE,
                    quiet = !VERBOSE_PRINTS,
                    Ncpus = parallel::detectCores())
  }, error = function(e) {
    cat("Binary install failed, trying source...\n")
    install.packages(missing_essential,
                    repos = repos,
                    type = "source",
                    dependencies = FALSE,
                    quiet = !VERBOSE_PRINTS)
  })
}

# Load essential packages
essential_loaded <- sapply(essential_packages, load_pkg)
essential_success <- sum(essential_loaded)

cat("✓ Essential packages loaded:", essential_success, "/", length(essential_packages), "\n")

# Quick install optional packages (30s timeout)
missing_optional <- optional_packages[!sapply(optional_packages, pkg_available)]

if(length(missing_optional) > 0) {
  cat("Installing optional packages...\n")

  for(pkg in missing_optional) {
    tryCatch({
      setTimeLimit(cpu = 30, elapsed = 30, transient = TRUE)
      install.packages(pkg, repos = repos,
                      type = getOption("pkgType"),
                      dependencies = FALSE,
                      quiet = TRUE)
      cat("✓", pkg, "installed\n")
    }, error = function(e) {
      cat("⚠️", pkg, "skipped (timeout)\n")
    })

    setTimeLimit(cpu = Inf, elapsed = Inf, transient = FALSE)
  }
}

# Load optional packages
optional_loaded <- sapply(optional_packages, load_pkg)
optional_success <- sum(optional_loaded)

cat("✓ Optional packages loaded:", optional_success, "/", length(optional_packages), "\n")

# Check if we need Vertex AI aggressive installation
has_datatable <- require("data.table", quietly = TRUE)
has_ggplot <- require("ggplot2", quietly = TRUE)

if(is_vertex && (!has_datatable || !has_ggplot)) {
  cat("\n🔧 Standard installation incomplete on Vertex AI - using aggressive method...\n")

  # Check R version for compatibility
  r_version <- R.version.string
  r_numeric <- as.numeric(R.version$major) + as.numeric(R.version$minor)/10
  cat("R version:", r_version, "(numeric:", r_numeric, ")\n")

  # System dependencies for Vertex AI
  system_deps <- c(
    "apt-get update -qq",
    "apt-get install -y libfontconfig1-dev libcairo2-dev",
    "apt-get install -y libxml2-dev libcurl4-openssl-dev libssl-dev",
    "apt-get install -y libharfbuzz-dev libfribidi-dev",
    "apt-get install -y libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev"
  )

  cat("Installing system dependencies...\n")
  for(cmd in system_deps) {
    system(paste("sudo", cmd), ignore.stdout = TRUE, ignore.stderr = TRUE)
  }

  # Aggressive package installation for failed packages
  failed_packages <- c()
  if(!has_datatable) failed_packages <- c(failed_packages, "data.table")
  if(!has_ggplot) failed_packages <- c(failed_packages, "ggplot2", "scales")

  repos_vertex <- c("https://cran.rstudio.com/", "https://cloud.r-project.org")

  for(pkg in failed_packages) {
    cat("Aggressively installing", pkg, "...")
    installed <- FALSE

    # For ggplot2/scales, try version-specific installation if R < 4.1
    if((pkg == "ggplot2" || pkg == "scales") && r_numeric < 4.1) {
      cat("(R < 4.1 detected - trying compatible versions)...")

      # First try to install remotes if not available
      if(!require("remotes", quietly = TRUE)) {
        tryCatch({
          install.packages("remotes", repos = repos_vertex[1], quiet = TRUE)
        }, error = function(e) NULL)
      }

      # Try installing older compatible versions
      if(pkg == "ggplot2" && require("remotes", quietly = TRUE)) {
        # Try ggplot2 versions compatible with older R
        old_versions <- c("3.4.4", "3.4.3", "3.4.2", "3.4.0", "3.3.6")
        for(ver in old_versions) {
          tryCatch({
            remotes::install_version("ggplot2", version = ver, repos = repos_vertex[1], quiet = TRUE)
            if(require("ggplot2", quietly = TRUE)) {
              cat(" ✓ (v", ver, ")\n")
              installed <- TRUE
              break
            }
          }, error = function(e) NULL)
        }
      } else if(pkg == "scales" && require("remotes", quietly = TRUE)) {
        # Try scales versions compatible with older R
        old_versions <- c("1.3.0", "1.2.1", "1.2.0", "1.1.1")
        for(ver in old_versions) {
          tryCatch({
            remotes::install_version("scales", version = ver, repos = repos_vertex[1], quiet = TRUE)
            if(require("scales", quietly = TRUE)) {
              cat(" ✓ (v", ver, ")\n")
              installed <- TRUE
              break
            }
          }, error = function(e) NULL)
        }
      }

      # Fallback: try archived CRAN packages if remotes failed
      if(!installed && pkg == "ggplot2") {
        cat("(trying archived versions)...")
        archived_urls <- c(
          "https://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_3.4.4.tar.gz",
          "https://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_3.4.0.tar.gz",
          "https://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_3.3.6.tar.gz"
        )
        for(url in archived_urls) {
          tryCatch({
            install.packages(url, repos = NULL, type = "source", quiet = TRUE)
            if(require("ggplot2", quietly = TRUE)) {
              cat(" ✓ (archived)\n")
              installed <- TRUE
              break
            }
          }, error = function(e) NULL)
        }
      }
    }

    # Standard installation if version-specific didn't work
    if(!installed) {
      for(repo in repos_vertex) {
        tryCatch({
          install.packages(pkg, repos = repo, dependencies = TRUE, quiet = TRUE)
          if(require(pkg, character.only = TRUE, quietly = TRUE)) {
            cat(" ✓\n")
            installed <- TRUE
            break
          }
        }, error = function(e) NULL)
      }
    }

    if(!installed) cat(" FAILED\n")
  }

  # Re-check after aggressive installation
  has_datatable <- require("data.table", quietly = TRUE)
  has_ggplot <- require("ggplot2", quietly = TRUE)

  cat("📦 After aggressive installation: data.table =", has_datatable, "| ggplot2 =", has_ggplot, "\n")
}

# Final status check (universal)
if(has_datatable && has_ggplot) {
  cat("🎉 Core environment ready! (data.table + ggplot2)\n")
  setDTthreads(0)  # Use all cores

} else if(has_datatable) {
  cat("⚠️ Partial setup: data.table ready, plotting may be limited\n")
  setDTthreads(0)

} else {
  cat("❌ Critical failure: data.table not available\n")
  stop("Cannot proceed without data.table")
}

Check if R setup is complete.

In [None]:
%%R

# Quick verification and setup
cat("Verifying R environment...\n")

# Test core functionality
tryCatch({
  # Test data.table (essential)
  dt_test <- data.table(x = 1:3, y = letters[1:3])
  cat("✓ data.table ready\n")

  # Test ggplot2 (optional)
  if(require("ggplot2", quietly = TRUE)) {
    cat("✓ ggplot2 ready\n")
  } else {
    cat("⚠️ ggplot2 not available (plots disabled)\n")
  }

  # Set up data.table options for performance
  setDTthreads(0)  # Use all cores

  cat("✓ R environment optimized and ready!\n")

}, error = function(e) {
  cat("❌ R environment verification failed:", e$message, "\n")
  stop("R setup incomplete")
})

# Clean up test objects
rm(list = ls()[!ls() %in% c("VERBOSE_PRINTS")])
invisible(gc())

---
## 4. Data Loading

Config based data loader with error handling.

In [None]:
import pandas as pd
import requests
from pathlib import Path

class DataLoader:
    def __init__(self):
        self.base_path = Path('/content') if env_manager.is_colab else Path.home()
        self.data_dir = self.base_path / 'data'
        self.data_dir.mkdir(exist_ok=True)

    def load_data(self):
        """Load data based on configuration"""
        loaders = {
            'github': self._load_github,
            'local': self._load_local,
            'gcs': self._load_gcs,
            'drive': self._load_drive
        }

        if DATA_SOURCE not in loaders:
            raise ValueError(f"Invalid DATA_SOURCE: {DATA_SOURCE}")

        print(f"Loading data from: {DATA_SOURCE.upper()}")
        return loaders[DATA_SOURCE]()

    def _load_github(self):
        """Load from GitHub releases"""
        response = requests.get(GITHUB_URL, timeout=60)
        response.raise_for_status()

        data_path = self.data_dir / "nass_data.csv"
        data_path.write_bytes(response.content)

        print(f"✓ Downloaded from GitHub ({response.headers.get('content-length', 'unknown')} bytes)")
        return pd.read_csv(data_path)

    def _load_local(self):
        """Load from local file"""
        search_paths = [
            self.base_path / LOCAL_FILENAME,
            self.data_dir / LOCAL_FILENAME,
            Path.cwd() / LOCAL_FILENAME
        ]

        for path in search_paths:
            if path.exists():
                print(f"✓ Found local file: {path}")
                return pd.read_csv(path)

        raise FileNotFoundError(f"File not found in: {[str(p) for p in search_paths]}")

    def _load_gcs(self):
        """Load from Google Cloud Storage"""
        from google.cloud import storage

        # Smart authentication
        if Path(GCS_SERVICE_ACCOUNT_KEY).exists():
            client = storage.Client.from_service_account_json(GCS_SERVICE_ACCOUNT_KEY)
        else:
            client = storage.Client()  # Use default credentials

        bucket = client.bucket(GCS_BUCKET)
        blob = bucket.blob(GCS_BLOB)

        data_path = self.data_dir / "nass_data.csv"
        blob.download_to_filename(data_path)

        print(f"✓ Downloaded from GCS: {GCS_BUCKET}/{GCS_BLOB}")
        return pd.read_csv(data_path)

    def _load_drive(self):
        """Load from Google Drive (Colab only)"""
        if not env_manager.is_colab:
            raise RuntimeError("Drive loading only available in Google Colab")

        drive_path = Path(DRIVE_PATH)
        if not drive_path.exists():
            raise FileNotFoundError(f"Drive file not found: {DRIVE_PATH}")

        print(f"✓ Loading from Google Drive: {DRIVE_PATH}")
        return pd.read_csv(drive_path)

# Load data
try:
    loader = DataLoader()
    df = loader.load_data()

    print(f"✅ Data loaded successfully!")
    print(f"   Shape: {df.shape}")
    print(f"   Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

    if VERBOSE_PRINTS:
        print(f"\nColumns: {list(df.columns)}")
        print(f"\nFirst 3 rows:")
        print(df.head(3))

except Exception as e:
    print(f"❌ Data loading failed: {e}")
    print(f"💡 Try changing DATA_SOURCE or check file paths")
    raise

Check if data has been loaded.

In [None]:
# Verify data is available before R processing
try:
    if 'df' not in globals():
        print("❌ Data not loaded!")
        print("💡 Please run the 'Data Loading' section first (cell 12)")
        print("   This will create the 'df' variable needed for R analysis")
        raise NameError("df variable not found - run data loading first")

    print(f"✅ Data verified: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print("✅ Ready for R analysis")

except NameError as e:
    print(f"❌ {e}")
    print("\n🔄 Quick fix: Run these cells in order:")
    print("   1. Configuration (cell 5)")
    print("   2. Environment Setup (cell 7)")
    print("   3. Data Loading (cell 12)")
    print("   4. Then continue with R analysis")
    raise

## 5. Complete Data Preprocessing

Streamlined preprocessing: remove variables, clean data types, and create new variables - in Python prior to passing to R for efficiency.

In [None]:
# Data Preprocessing in Python (before R transfer)
print("Complete preprocessing: removing variables + cleaning data types...")
print(f"Original shape: {df.shape}")

# ===== 1. REMOVE UNNECESSARY VARIABLES =====
print("\n1 Removing unnecessary variables...")

# Smart pattern-based removal in pandas (much faster than R)
drop_patterns = [
    r'^CPTCCS[2-9]$',      # CPTCCS2-CPTCCS9
    r'^CPTCCS[1-3][0-9]$', # CPTCCS10-30
    r'^CPT[2-9]$',         # CPT2-CPT9
    r'^CPT[1-3][0-9]$',    # CPT10-30
    r'^DXCCSR_',           # All DXCCSR columns (500+)
]

# Find columns to drop using vectorized operations
drop_cols = []
for pattern in drop_patterns:
    matches = df.columns[df.columns.str.match(pattern)].tolist()
    drop_cols.extend(matches)

# Remove duplicates
drop_cols = list(set(drop_cols))

print(f"   📊 Found {len(drop_cols)} columns to drop")
print(f"   🗑️  Patterns: CPTCCS2-30, CPT2-30, all DXCCSR_*")

# Drop the columns
df = df.drop(columns=drop_cols)
print(f"   ✅ Reduced from {df.shape[1] + len(drop_cols)} to {df.shape[1]} columns")

# ===== 2. CLEAN DATA TYPES FOR rpy2 =====
print("\n2️⃣ Cleaning data types for rpy2 compatibility...")

# Convert all object columns to strings (prevents mixed-type issues)
object_columns = df.select_dtypes(include=['object']).columns
if len(object_columns) > 0:
    for col in object_columns:
        df[col] = df[col].astype(str)
    print(f"   ✅ Converted {len(object_columns)} object columns to strings")

# Handle NaN/inf values consistently
df = df.fillna('')  # Replace NaN with empty strings
float_cols = df.select_dtypes(include=['float64']).columns
if len(float_cols) > 0:
    df[float_cols] = df[float_cols].replace([float('inf'), float('-inf')], '')
    print(f"   ✅ Cleaned inf values in {len(float_cols)} float columns")

# ===== 3. CREATE KEY ANALYTICAL VARIABLES IN PANDAS =====
print("\n3️⃣ Creating analytical variables...")

# Create WHITE indicator (1=White, 0=Non-White)
if 'RACE' in df.columns:
    df['WHITE'] = (df['RACE'].astype(str) == '1').astype(int)
    print("   ✅ Created a race indicator boolean")

# Create age groups
if 'AGE' in df.columns:
    df['AGE'] = pd.to_numeric(df['AGE'], errors='coerce')  # Ensure numeric
    df['AGE_GROUP'] = pd.cut(df['AGE'],
                            bins=[0, 18, 30, 45, 65, float('inf')],
                            labels=['0-17', '18-29', '30-44', '45-64', '65+'],
                            right=False)
    df['AGE_GROUP'] = df['AGE_GROUP'].astype(str)  # Convert to string for R
    print("   ✅ Created AGE_GROUP categories")

# Create income level labels
if 'ZIPINC_QRTL' in df.columns:
    income_map = {1: 'Q1-Lowest', 2: 'Q2', 3: 'Q3', 4: 'Q4-Highest'}
    df['INCOME_LEVEL'] = df['ZIPINC_QRTL'].astype(str).map(lambda x: income_map.get(int(x) if x.isdigit() else 0, 'Unknown'))
    print("   ✅ Created INCOME_LEVEL labels")

# Ensure key numeric variables are properly typed
numeric_vars = ['AGE', 'DISCWT', 'TOTCHG']
for var in numeric_vars:
    if var in df.columns:
        df[var] = pd.to_numeric(df[var], errors='coerce')

print(f"\n✅ PREPROCESSING COMPLETE!")
print(f"📊 Final shape: {df.shape}")
print(f"💾 Ready for R transfer!")

## 6. Final R Transfer & Processing

Transfer the clean data to R and apply any final R-specific formatting.

In [None]:
%%R -i df -i VERBOSE_PRINTS

# Convert to data.table and apply R types
NASS <- as.data.table(df)

# Factor variables
factor_vars <- c("ZIPINC_QRTL", "PAY1", "CPTCCS1", "HOSP_LOCATION",
                 "HOSP_TEACH", "HOSP_NASS", "RACE", "AGE_GROUP", "INCOME_LEVEL")
existing_factors <- factor_vars[factor_vars %in% names(NASS)]
NASS[, (existing_factors) := lapply(.SD, as.factor), .SDcols = existing_factors]

# Boolean variables
if("FEMALE" %in% names(NASS)) NASS[, FEMALE := as.logical(as.numeric(FEMALE))]
if("WHITE" %in% names(NASS)) NASS[, WHITE := as.logical(as.numeric(WHITE))]

# Compact output
cat("✅ R Complete:", nrow(NASS), "rows,", ncol(NASS), "cols,",
    round(object.size(NASS)/1024^2, 1), "MB\n")
cat("Converted", length(existing_factors), "factors + 2 booleans\n")

if(VERBOSE_PRINTS) {
  cat("\nColumns:\n")
  print(colnames(NASS))
}

cat("Ready for analysis!\n")

# Analysis

---
## 1. R Analysis - Income Quartile vs Procedure

Visualize income distribution within the most common procedures.

In [None]:
%%R

# ===== GENERATE SUMMARY STATISTICS =====
cat("\n=== DATASET ANALYSIS SUMMARY ===\n")
cat("📊 Data shape:", nrow(NASS), "rows x", ncol(NASS), "columns\n")
flush.console()

# Create top 10 procedures table
if("CPTCCS1" %in% names(NASS)) {
  top10 <- NASS[, .N, by = CPTCCS1][order(-N)][1:10]
  cat("\n🔝 Top 10 procedures (CPTCCS1):\n")
  print(top10)
} else {
  cat("⚠️ CPTCCS1 variable not found\n")
}

# Basic demographic summary
if("WHITE" %in% names(NASS)) {
  white_pct <- round(mean(NASS$WHITE, na.rm = TRUE) * 100, 1)
  cat("\n👥 Race: White =", white_pct, "%, Non-White =", 100 - white_pct, "%\n")
}

if("AGE" %in% names(NASS)) {
  age_summary <- summary(NASS$AGE)
  cat("\n📅 Age distribution:\n")
  print(age_summary)
}

if("INCOME_LEVEL" %in% names(NASS)) {
  income_dist <- NASS[, .N, by = INCOME_LEVEL][order(INCOME_LEVEL)]
  cat("\n💰 Income quartiles:\n")
  print(income_dist)
}

cat("\n✅ Summary complete - ready for analysis!\n")
flush.console()

---
## 10. Census API Setup

Set up environment variable for Census API key.

In [None]:
import getpass, os, json, textwrap
os.environ["CENSUS_API_KEY"] = getpass.getpass("Enter your Census API key (will not echo):")

---
## 11. R | Set Census Key & Pull 2020 DHC Totals

In [None]:
%%R -i states_in_nass=character() -i VERBOSE_PRINTS
# If you've already installed the key once, this is a no-op
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"), overwrite = FALSE, install = FALSE)

get_vars <- function(base) sprintf("%s_%03dN", base, 1:49)

vars_total <- get_vars("P12")
vars_white <- get_vars("P12I")

pull_state_totals <- function(vars){
  get_decennial(geography = "state",
                variables = vars,
                year = 2020, sumfile = "dhc") |>
  group_by(NAME) |> summarise(total = sum(value))
}

total_pop  <- pull_state_totals(vars_total)
white_pop  <- pull_state_totals(vars_white)

census_prop <- merge(total_pop, white_pop, by = "NAME",
                     suffixes = c("_all","_white"))
census_prop[, prop_white := total_white / total_all]

if (VERBOSE_PRINTS) head(census_prop)

---
## 12. R | Weighted vs Unweighted Proportion Test

In [None]:
%%R
library(survey)

# Survey design using provided discharge weight
des <- svydesign(ids = ~1, weights = ~DISCWT, data = NASS)

unweighted_hat <- mean(NASS$WHITE)
weighted_hat   <- svymean(~WHITE, des)[1]

us_prop <- weighted.mean(census_prop$prop_white,
                         w = census_prop$total_all)

cat(sprintf("Unweighted NASS white %%: %.3f\n", unweighted_hat))
cat(sprintf("Weighted   NASS white %%: %.3f\n", weighted_hat))
cat(sprintf("2020 Census (all NASS states) white %%: %.3f\n", us_prop))

svytest <- svyciprop(~WHITE, des,
                     method = "likelihood", level = 0.95)
print(svytest)

---
## 13. R | Age-by-sex plot vs Census (adapted from `agesociodiv.r`)

In [None]:
%%R
age_breaks <- c(-Inf,4,9,14,17,19,20,21,24,29,34,39,44,49,54,59,61,64,
                66,69,74,79,84,Inf)
age_labels <- c("U5","5-9","10-14","15-17","18-19","20","21",
                "22-24","25-29","30-34","35-39","40-44","45-49",
                "50-54","55-59","60-61","62-64","65-66","67-69",
                "70-74","75-79","80-84","85+")

NASS[, AGE_GROUP := cut(AGE, breaks = age_breaks,
                        labels = age_labels, right = TRUE)]

plot_df <- NASS[, .(white = sum(WHITE),
                    n     = .N),
                by = .(SEX = factor(FEMALE, labels=c("Male","Female")),
                       AGE_GROUP)]
plot_df[, prop := white/n]

# Create plot with intelligent fallback
if(require("ggplot2", quietly = TRUE) && require("scales", quietly = TRUE)) {
  # Primary: ggplot2 with scales
  gg_gender <- ggplot(plot_df, aes(x = AGE_GROUP, y = prop,
                                   group = SEX, color = SEX)) +
    geom_line(linewidth=1) +
    geom_point() +
    scale_y_continuous(labels = scales::percent) +
    labs(y = "% White (NASS, simulated)", x = "Age-group",
         title = "Crude white proportion by age & sex") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=45, hjust=1))
  print(gg_gender)
  cat("📊 ggplot2 plot generated\n")

} else if(require("ggplot2", quietly = TRUE)) {
  # Fallback: ggplot2 without scales
  gg_gender <- ggplot(plot_df, aes(x = AGE_GROUP, y = prop,
                                   group = SEX, color = SEX)) +
    geom_line(linewidth=1) +
    geom_point() +
    labs(y = "Proportion White (NASS, simulated)", x = "Age-group",
         title = "Crude white proportion by age & sex") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle=45, hjust=1))
  print(gg_gender)
  cat("📊 ggplot2 plot generated (without percentage scaling)\n")

} else {
  # Base R fallback
  cat("📊 Using base R plotting (ggplot2 not available)\n")

  male_data <- plot_df[SEX == "Male"]
  female_data <- plot_df[SEX == "Female"]

  plot(1:nrow(male_data), male_data$prop, type = "b", col = "blue",
       xlab = "Age Group", ylab = "Proportion White",
       main = "Crude white proportion by age & sex",
       ylim = c(0, 1), xaxt = "n")
  lines(1:nrow(female_data), female_data$prop, type = "b", col = "red")
  axis(1, at = 1:nrow(male_data), labels = male_data$AGE_GROUP, las = 2)
  legend("topright", legend = c("Male", "Female"), col = c("blue", "red"), lty = 1)
}

---
## 14. R | Multilevel logistic models (hospital nested, 3 tiers)

In [None]:
%%R
features <- NASS[, .(WHITE,
                     FEMALE,
                     ZIPINC_QRTL,
                     PAY1,
                     CPTCCS1,
                     HOSP_LOCATION,
                     HOSP_TEACH,
                     HOSP_NASS)]

features[, c(names(features)) := lapply(.SD, as.factor)]

formulas <- list(
  m1 = WHITE ~ FEMALE + (1|HOSP_NASS),
  m2 = WHITE ~ FEMALE + ZIPINC_QRTL + (1|HOSP_NASS),
  m3 = WHITE ~ FEMALE + ZIPINC_QRTL + PAY1 + CPTCCS1 +
                    HOSP_LOCATION + HOSP_TEACH + (1|HOSP_NASS)
)

fit <- lapply(formulas, glmer, family = binomial, data = features,
              control = glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e4)))

sapply(fit, function(m) broom::tidy(m, effects = "fixed")[1:5,])

---
## 15. R | Compare AUC across the three models

In [None]:
%%R
library(pROC)
auc_vals <- sapply(fit, function(m){
  preds <- predict(m, type="response")
  roc(features$WHITE, preds)$auc
})
knitr::kable(data.frame(model = names(auc_vals), AUC = auc_vals),
             caption = "AUC (in-sample, simulated data)")

# Poster Generation

Poster genreated here
