[![Open on GitHub](https://img.shields.io/badge/GitHub-View%20Source-181717?style=for-the-badge&logo=github)](https://github.com/SeenaKhosravi/NASS/blob/main/Analysis_NASS.ipynb)
[![Open In Colab](https://img.shields.io/badge/Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=google-colab)](https://colab.research.google.com/github/SeenaKhosravi/NASS/blob/main/Analysis_NASS.ipynb)
[![Open in Vertex AI](https://img.shields.io/badge/Vertex%20AI-Open%20Workbench-4285F4?style=for-the-badge&logo=google-cloud)](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/SeenaKhosravi/NASS/main/Analysis_NASS.ipynb)


# Socioeconomic and Demographic Drivers of Ambulatory Surgery Usage
### HCUP NASS 2020 – Reproducible Pipeline (Python + R)

**Author:** Seena Khosravi, MD  
**LLMs Utilized:** Claude Sonnet 4.1, Opus 4.1; ChatGPT 4o, o4; Deepseek 3.1; Gemini 2.5 Pro  
**Last Updated:** September 9, 2025  

**Data Source:**  
Department of Health & Human Services (HHS)  
Agency for Healthcare Research and Quality (AHRQ)  
Healthcare Cost and Utilization Project (HCUP)  
National Ambulatory Surgical Sample (NASS) 
Year - 2020 

---

## Overview
This notebook provides a reproducible analysis pipeline for examining socioeconomic and demographic factors influencing ambulatory surgery usage patterns. The analysis combines Python for data processing and R for statistical modeling. It assumes a processed dataset was created from the HCUP files via [Raw_NASS_Processing.R](https://github.com/SeenaKhosravi/NASS/blob/a7764ce80be8a82fc449831821c27d957176c410/Raw%20NASS%20%20Processing.R).

### Data Usage Agreement
**DUA Compliant Online Implementation** — This notebook uses a simulated, artificial, smaller dataset with identical structure to the file created by [Raw_NASS_Processing.R](https://github.com/SeenaKhosravi/NASS/blob/a7764ce80be8a82fc449831821c27d957176c410/Raw%20NASS%20%20Processing.R). The simulated dataset production methodology is also in [Raw_NASS_Processing.R](https://github.com/SeenaKhosravi/NASS/blob/a7764ce80be8a82fc449831821c27d957176c410/Raw%20NASS%20%20Processing.R). If DUA signed and data purchased from HCUP, this notebook can run on full dataset loaded from your local storage or cloud storage. 

[Please see the DUA Agreement here.](https://hcup-us.ahrq.gov/team/NationwideDUA.jsp)

### Key Features
- **Multiple Platform:** Works on jupyter implementations via local environments, server, cloud VM instance, or platform as a service.
- **Flexible Data Storage:** GitHub (simulated, static, open access), Google Drive, Google Cloud Storage, or local files (closed access)
- **Reproducible:** All dependencies and environment setup included
- **Scalable:** Handles both simulated (1GB, 700k rows) and full dataset (12 GB, 7.8M rows). Scalable cloud options. 

---

## Design Notes

### Architecture
- **Python primary, w/ R run via rpy2 python extension**
- **Python cells** handle "plumbing" (file I/O, environment setup, rpy2 configuration, data previews)
- **R cells** (prefixed by `%%R`) perform statistical analysis: survey weights, Census lookups, multilevel models, plots, classifiers, etc.

### Data Sources
- **Default:** Simulated dataset (1GB) from GitHub releases
- **Local:** Switch to locally stored files via configuration
- **Drive:** Google Drive (Only availble in Colab)
- **Cloud:** Google Cloud Storage support for large datasets


### Environment Support
- Local (Jupyterlab w/ Python 3.12 kernel)
- Jupyter Server (may require some configuring depending on your implementation)
- Google Colab (Pro recommended, high-ram option)
- Vertex AI Workbench (JupyterLab 3, Python 3 kernel) (used for full analysis)

---

---
## 1. Configuration

Configure all settings here prior to run - data sources, debugging options, and file paths. Defaults to simulated dataset.

In [1]:
# ==================== CONFIGURATION ====================
# Data Source Options
DATA_SOURCE = "github"      # Options: "github", "local", "gcs", "drive"
VERBOSE_PRINTS = True       # False → suppress debug output

# GitHub source (default - simulated data)
GITHUB_URL = "https://github.com/SeenaKhosravi/NASS/releases/download/v1.0.0/nass_2020_simulated.csv"

# Local file options
LOCAL_FILENAME = "nass_2020_local.csv"

# Google Cloud Storage options
GCS_BUCKET = "nass_2020"
GCS_BLOB = "nass_2020_all.csv"
GCS_SERVICE_ACCOUNT_KEY = "/path/to/service-account-key.json"  # Optional

# Google Drive options (for Colab)
DRIVE_PATH = "/content/drive/MyDrive/NASS/nass_2020_full.csv"
# ======================================================

print("✓ Configuration loaded")
print(f"  Data source: {DATA_SOURCE}")
print(f"  Verbose mode: {VERBOSE_PRINTS}")

✓ Configuration loaded
  Data source: github
  Verbose mode: True


---
## 2. Environment Setup & Package Installation

Detect environment, and define python functions for loading packages, via Conda unless in colab.

In [None]:
import os
import sys
import subprocess
from pathlib import Path

class EnvironmentManager:
    def __init__(self):
        self.detect_environment()
        self.setup_packages()
    
    def detect_environment(self):
        """Detect runtime environment"""
        self.is_colab = 'COLAB_GPU' in os.environ or 'google.colab' in sys.modules
        self.is_vertex = 'DL_ANACONDA_HOME' in os.environ
        
        if self.is_colab:
            self.env_type = "Google Colab"
        elif self.is_vertex:
            self.env_type = "Vertex AI"
        else:
            self.env_type = "Local/Jupyter"
        
        print(f"Environment: {self.env_type}")
    
    def check_conda_available(self):
        """Check if conda is available"""
        try:
            subprocess.check_call(['conda', '--version'], 
                                stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            return True
        except (subprocess.CalledProcessError, FileNotFoundError):
            return False
    
    def install_package(self, package, conda_name=None):
        """Smart package installation with fallback"""
        try:
            __import__(package)
            return True
        except ImportError:
            print(f"Installing {package}...")
            
            # Try conda first if available and not in Colab
            if conda_name and not self.is_colab and self.check_conda_available():
                try:
                    subprocess.check_call(['conda', 'install', '-y', conda_name], 
                                        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                    return True
                except subprocess.CalledProcessError:
                    print(f"  Conda install failed for {conda_name}, trying pip...")
            
            # Fallback to pip
            try:
                subprocess.check_call([sys.executable, '-m', 'pip', 'install', package], 
                                    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                return True
            except subprocess.CalledProcessError as e:
                print(f"  Pip install failed for {package}: {e}")
                return False
    
    def setup_packages(self):
        """Install required packages efficiently"""
        packages = {
            'pandas': 'pandas',
            'requests': 'requests', 
            'rpy2': 'rpy2',
            'google.cloud.storage': 'google-cloud-storage'
        }
        
        print("Installing/checking packages...")
        failed = []
        
        for pkg, install_name in packages.items():
            if not self.install_package(pkg, install_name):
                failed.append(pkg)
        
        # Store failed packages globally for recovery
        globals()['failed_packages'] = failed
        
        if failed:
            print(f"⚠️  Failed to install: {', '.join(failed)}")
            print("Some features may not work")
            
            # Provide specific guidance for rpy2
            if 'rpy2' in failed:
                print("\n💡 For rpy2 installation issues:")
                if self.is_vertex:
                    print("   - Vertex AI: R may not be installed by default")
                    print("   - Run the next cell for automated R setup")
                else:
                    print("   - On Windows: May need Visual Studio Build Tools")
                    print("   - Try: conda install -c conda-forge rpy2")
                    print("   - Or: pip install rpy2 (requires R to be installed)")
        else:
            print("✓ All packages ready")
        
        # Mount Google Drive if needed (check if DATA_SOURCE exists)
        try:
            if globals().get('DATA_SOURCE') == "drive" and self.is_colab:
                self.mount_drive()
        except NameError:
            pass  # DATA_SOURCE not defined yet
    
    def mount_drive(self):
        """Mount Google Drive in Colab"""
        try:
            from google.colab import drive
            drive.mount('/content/drive')
            print("✓ Google Drive mounted")
        except:
            print("❌ Failed to mount Google Drive")

# Initialize environment
env_manager = EnvironmentManager()

Environment: Local/Jupyter
Installing/checking packages...
Installing pandas...
Installing requests...
Installing requests...
Installing rpy2...
Installing rpy2...
Installing google.cloud.storage...
Installing google.cloud.storage...
✓ All packages ready
✓ All packages ready


If Running in Vertex Workbench, and last cell did not work, run this cell to install R, and re-run last cell. 

Otherwise, skip this cell. 

In [None]:
# Vertex AI Workbench: R and rpy2 Installation Helper
def setup_r_for_vertex_ai():
    """Install R and rpy2 on Vertex AI Workbench"""
    print("Setting up R environment for Vertex AI...")
    
    # Check if R is installed
    try:
        subprocess.check_call(['R', '--version'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        print("✓ R is already installed")
        r_installed = True
    except (subprocess.CalledProcessError, FileNotFoundError):
        print("❌ R not found - installing...")
        r_installed = False
    
    # Install R if not present
    if not r_installed:
        try:
            # Update package list
            subprocess.check_call(['sudo', 'apt-get', 'update'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            
            # Install R
            subprocess.check_call(['sudo', 'apt-get', 'install', '-y', 'r-base', 'r-base-dev'], 
                                stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            
            # Install development tools needed for rpy2
            subprocess.check_call(['sudo', 'apt-get', 'install', '-y', 'build-essential', 'libcurl4-openssl-dev', 
                                 'libssl-dev', 'libxml2-dev'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            
            print("✓ R installed successfully")
            
        except subprocess.CalledProcessError as e:
            print(f"❌ Failed to install R: {e}")
            return False
    
    # Now try to install rpy2
    print("Installing rpy2...")
    methods = [
        ([sys.executable, '-m', 'pip', 'install', '--upgrade', 'pip', 'setuptools', 'wheel'], "upgrade pip tools"),
        ([sys.executable, '-m', 'pip', 'install', 'rpy2'], "pip install rpy2"),
        (['conda', 'install', '-c', 'conda-forge', 'rpy2', '-y'], "conda-forge rpy2"),
        ([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '--force-reinstall', 'rpy2'], "force reinstall rpy2"),
    ]
    
    for cmd, description in methods:
        try:
            print(f"  Trying: {description}")
            subprocess.check_call(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            print(f"  ✓ Success: {description}")
            
            # Test if rpy2 can be imported
            try:
                import rpy2
                print("✓ rpy2 installed and importable")
                return True
            except ImportError:
                print(f"  ⚠️ {description} completed but rpy2 still not importable")
                continue
                
        except (subprocess.CalledProcessError, FileNotFoundError):
            print(f"  ✗ Failed: {description}")
            continue
    
    print("❌ All rpy2 installation methods failed")
    return False

def install_rpy2_alternatives():
    """Try different methods to install rpy2 with environment-specific handling"""
    print("Attempting rpy2 installation...")
    
    # Check if we're on Vertex AI and try specialized setup
    if env_manager.is_vertex:
        print("Detected Vertex AI - using specialized R setup...")
        if setup_r_for_vertex_ai():
            return True
    
    # Standard installation methods
    print("Trying standard installation methods...")
    methods = [
        ([sys.executable, '-m', 'pip', 'install', '--upgrade', 'pip'], "upgrade pip"),
        ([sys.executable, '-m', 'pip', 'install', 'rpy2'], "pip install rpy2"),
        (['conda', 'install', '-c', 'conda-forge', 'rpy2', '-y'], "conda-forge rpy2"),
        ([sys.executable, '-m', 'pip', 'install', '--no-cache-dir', 'rpy2'], "pip no-cache rpy2"),
    ]
    
    for cmd, description in methods:
        try:
            print(f"Trying: {description}")
            subprocess.check_call(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
            
            # Test import
            try:
                import rpy2
                print("✓ Success!")
                return True
            except ImportError:
                continue
                
        except (subprocess.CalledProcessError, FileNotFoundError):
            print(f"✗ Failed: {description}")
            continue
    
    print("❌ All rpy2 installation methods failed")
    print("\n💡 Manual solutions:")
    
    if env_manager.is_vertex:
        print("For Vertex AI Workbench:")
        print("   1. Open a terminal in JupyterLab")
        print("   2. Run: sudo apt-get update")
        print("   3. Run: sudo apt-get install -y r-base r-base-dev build-essential")
        print("   4. Run: pip install rpy2")
        print("   5. Restart the Python kernel")
    else:
        print("General solutions:")
        print("   1. Ensure R is installed on your system")
        print("   2. On Windows: install Visual Studio Build Tools")
        print("   3. Try: pip install --upgrade pip setuptools wheel")
        print("   4. Then: pip install rpy2")
    
    return False

# Run the installation if rpy2 failed in the main setup
if 'rpy2' in globals().get('failed_packages', []):
    print("\n" + "="*50)
    print("ATTEMPTING rpy2 RECOVERY")
    print("="*50)
    install_rpy2_alternatives()
    print("="*50)

### Manual R Installation for Vertex AI Workbench

If automatic installation still fails, you can manually install R and rpy2 by opening a **Terminal** in JupyterLab and running these commands:

```bash
# Update package list
sudo apt-get update

# Install R and development tools
sudo apt-get install -y r-base r-base-dev build-essential libcurl4-openssl-dev libssl-dev libxml2-dev

# Install rpy2 
pip install --upgrade pip setuptools wheel
pip install rpy2

# Verify installation
python -c "import rpy2; print('rpy2 installed successfully')"
```

After running these commands:
1. **Restart the Python kernel** (Kernel → Restart Kernel)
2. Re-run the environment setup cells
3. Continue with the analysis

---

---
## 3. Data Loading

Config based data loader with error handling.

In [6]:
import pandas as pd
import requests
from pathlib import Path

class DataLoader:
    def __init__(self):
        self.base_path = Path('/content') if env_manager.is_colab else Path.home()
        self.data_dir = self.base_path / 'data'
        self.data_dir.mkdir(exist_ok=True)
    
    def load_data(self):
        """Load data based on configuration"""
        loaders = {
            'github': self._load_github,
            'local': self._load_local,
            'gcs': self._load_gcs,
            'drive': self._load_drive
        }
        
        if DATA_SOURCE not in loaders:
            raise ValueError(f"Invalid DATA_SOURCE: {DATA_SOURCE}")
        
        print(f"Loading data from: {DATA_SOURCE.upper()}")
        return loaders[DATA_SOURCE]()
    
    def _load_github(self):
        """Load from GitHub releases"""
        response = requests.get(GITHUB_URL, timeout=60)
        response.raise_for_status()
        
        data_path = self.data_dir / "nass_data.csv"
        data_path.write_bytes(response.content)
        
        print(f"✓ Downloaded from GitHub ({response.headers.get('content-length', 'unknown')} bytes)")
        return pd.read_csv(data_path)
    
    def _load_local(self):
        """Load from local file"""
        search_paths = [
            self.base_path / LOCAL_FILENAME,
            self.data_dir / LOCAL_FILENAME,
            Path.cwd() / LOCAL_FILENAME
        ]
        
        for path in search_paths:
            if path.exists():
                print(f"✓ Found local file: {path}")
                return pd.read_csv(path)
        
        raise FileNotFoundError(f"File not found in: {[str(p) for p in search_paths]}")
    
    def _load_gcs(self):
        """Load from Google Cloud Storage"""
        from google.cloud import storage
        
        # Smart authentication
        if Path(GCS_SERVICE_ACCOUNT_KEY).exists():
            client = storage.Client.from_service_account_json(GCS_SERVICE_ACCOUNT_KEY)
        else:
            client = storage.Client()  # Use default credentials
        
        bucket = client.bucket(GCS_BUCKET)
        blob = bucket.blob(GCS_BLOB)
        
        data_path = self.data_dir / "nass_data.csv"
        blob.download_to_filename(data_path)
        
        print(f"✓ Downloaded from GCS: {GCS_BUCKET}/{GCS_BLOB}")
        return pd.read_csv(data_path)
    
    def _load_drive(self):
        """Load from Google Drive (Colab only)"""
        if not env_manager.is_colab:
            raise RuntimeError("Drive loading only available in Google Colab")
        
        drive_path = Path(DRIVE_PATH)
        if not drive_path.exists():
            raise FileNotFoundError(f"Drive file not found: {DRIVE_PATH}")
        
        print(f"✓ Loading from Google Drive: {DRIVE_PATH}")
        return pd.read_csv(drive_path)

# Load data
try:
    loader = DataLoader()
    df = loader.load_data()
    
    print(f"✅ Data loaded successfully!")
    print(f"   Shape: {df.shape}")
    print(f"   Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
    
    if VERBOSE_PRINTS:
        print(f"\nColumns: {list(df.columns)}")
        print(f"\nFirst 3 rows:")
        print(df.head(3))
        
except Exception as e:
    print(f"❌ Data loading failed: {e}")
    print(f"💡 Try changing DATA_SOURCE or check file paths")
    raise

Loading data from: GITHUB
✓ Downloaded from GitHub (213436933 bytes)
✓ Downloaded from GitHub (213436933 bytes)


  return pd.read_csv(data_path)


✅ Data loaded successfully!
   Shape: (139233, 675)
   Memory: 885.2 MB

Columns: ['KEY_NASS', 'HOSP_NASS', 'HOSP_TEACH', 'HOSP_LOCATION', 'HOSP_LOCTEACH', 'HOSP_REGION', 'HOSP_BEDSIZE_CAT', 'DISCWT', 'NASS_STRATUM', 'N_DISC_U', 'N_HOSP_U', 'S_DISC_U', 'S_HOSP_U', 'TOTAL_AS_ENCOUNTERS', 'YEAR', 'AGE', 'FEMALE', 'PL_NCHS', 'ZIPINC_QRTL', 'AMONTH', 'AWEEKEND', 'DQTR', 'PAY1', 'DISPUNIFORM', 'TOTCHG', 'NCPT_INSCOPE', 'CPTCCS1', 'CPTCCS2', 'CPTCCS3', 'CPTCCS4', 'CPTCCS5', 'CPTCCS6', 'CPTCCS7', 'CPTCCS8', 'CPTCCS9', 'CPTCCS10', 'CPTCCS11', 'CPTCCS12', 'CPTCCS13', 'CPTCCS14', 'CPTCCS15', 'CPTCCS16', 'CPTCCS17', 'CPTCCS18', 'CPTCCS19', 'CPTCCS20', 'CPTCCS21', 'CPTCCS22', 'CPTCCS23', 'CPTCCS24', 'CPTCCS25', 'CPTCCS26', 'CPTCCS27', 'CPTCCS28', 'CPTCCS29', 'CPTCCS30', 'CPT1', 'CPT2', 'CPT3', 'CPT4', 'CPT5', 'CPT6', 'CPT7', 'CPT8', 'CPT9', 'CPT10', 'CPT11', 'CPT12', 'CPT13', 'CPT14', 'CPT15', 'CPT16', 'CPT17', 'CPT18', 'CPT19', 'CPT20', 'CPT21', 'CPT22', 'CPT23', 'CPT24', 'CPT25', 'CPT26', 'CPT27

---
## 4. R Environment Setup

Load R integration and install R packages efficiently.

In [7]:
# Load rpy2 extension for R integration
try:
    %load_ext rpy2.ipython
    print("✓ R integration loaded")
except Exception as e:
    print(f"❌ Failed to load R integration: {e}")
    print("Install rpy2: pip install rpy2")
    raise

Error importing in API mode: ImportError('On Windows, cffi mode "ANY" is only "ABI".')
Trying to import in ABI mode.
Trying to import in ABI mode.


✓ R integration loaded


In [None]:
%%R -i VERBOSE_PRINTS

# Smart R package installation
required_packages <- c("data.table", "survey", "dplyr", "tidyverse", 
                      "tidycensus", "ggplot2", "gridExtra", "pROC", 
                      "broom", "lme4")

# Check what's already installed
installed_packages <- rownames(installed.packages())
missing_packages <- required_packages[!required_packages %in% installed_packages]

# Install missing packages
if(length(missing_packages) > 0) {
  cat("Installing R packages:", paste(missing_packages, collapse=", "), "\n")
  install.packages(missing_packages, repos="https://cloud.r-project.org", 
                   quiet=!VERBOSE_PRINTS, dependencies=TRUE)
}

# Load all packages
success <- sapply(required_packages, function(pkg) {
  suppressMessages(suppressWarnings(library(pkg, character.only=TRUE, quietly=TRUE)))
})

cat("✓ R packages loaded:", sum(success), "/", length(required_packages), "\n")
if(any(!success)) {
  cat("⚠️  Failed to load:", paste(names(success)[!success], collapse=", "), "\n")
}

In [None]:
%%R -i df -i VERBOSE_PRINTS
options(datatable.print.nrows = 10)

NASS <- as.data.table(df)
if (VERBOSE_PRINTS) print(NASS[1:10])

# Light type coercion
num_cols  <- c("AGE","DISCWT","TOTCHG","TOTAL_AS_ENCOUNTERS")
NASS[, (num_cols) := lapply(.SD, as.numeric), .SDcols = num_cols]

# Boolean helper
NASS[, WHITE := fifelse(RACE == 1, 1, 0)]

In [None]:
%%R
cat("Rows:", nrow(NASS), "  Cols:", ncol(NASS), "\n")
top10 <- NASS[, .N, by = CPTCCS1][order(-N)][1:10]
knitr::kable(top10, caption = "Top 10 CPTCCS1 counts (simulated)")

---
## 9. R Analysis - Income Quartile vs Procedure

Visualize income distribution within the most common procedures.

In [None]:
%%R
top_codes <- top10$CPTCCS1
plt_income <- NASS[CPTCCS1 %in% top_codes] |>
  ggplot(aes(x = fct_infreq(CPTCCS1), fill = ZIPINC_QRTL)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  labs(y = "Share within CPT", x = "CPTCCS1", fill = "ZIP Quartile",
       title = "Income distribution within 10 most-common procedures")
print(plt_income)

---
## 10. Census API Setup

Set up environment variable for Census API key.

In [None]:
import getpass, os, json, textwrap
os.environ["CENSUS_API_KEY"] = getpass.getpass("Enter your Census API key (will not echo):")

---
## 11. R | Set Census Key & Pull 2020 DHC Totals

In [None]:
%%R -i states_in_nass=character() -i VERBOSE_PRINTS
# If you've already installed the key once, this is a no-op
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"), overwrite = FALSE, install = FALSE)

get_vars <- function(base) sprintf("%s_%03dN", base, 1:49)

vars_total <- get_vars("P12")
vars_white <- get_vars("P12I")

pull_state_totals <- function(vars){
  get_decennial(geography = "state",
                variables = vars,
                year = 2020, sumfile = "dhc") |>
  group_by(NAME) |> summarise(total = sum(value))
}

total_pop  <- pull_state_totals(vars_total)
white_pop  <- pull_state_totals(vars_white)

census_prop <- merge(total_pop, white_pop, by = "NAME",
                     suffixes = c("_all","_white"))
census_prop[, prop_white := total_white / total_all]

if (VERBOSE_PRINTS) head(census_prop)

---
## 12. R | Weighted vs Unweighted Proportion Test

In [None]:
%%R
library(survey)

# Survey design using provided discharge weight
des <- svydesign(ids = ~1, weights = ~DISCWT, data = NASS)

unweighted_hat <- mean(NASS$WHITE)
weighted_hat   <- svymean(~WHITE, des)[1]

us_prop <- weighted.mean(census_prop$prop_white,
                         w = census_prop$total_all)

cat(sprintf("Unweighted NASS white %%: %.3f\n", unweighted_hat))
cat(sprintf("Weighted   NASS white %%: %.3f\n", weighted_hat))
cat(sprintf("2020 Census (all NASS states) white %%: %.3f\n", us_prop))

svytest <- svyciprop(~WHITE, des,
                     method = "likelihood", level = 0.95)
print(svytest)

---
## 13. R | Age-by-sex plot vs Census (adapted from `agesociodiv.r`)

In [None]:
%%R
age_breaks <- c(-Inf,4,9,14,17,19,20,21,24,29,34,39,44,49,54,59,61,64,
                66,69,74,79,84,Inf)
age_labels <- c("U5","5-9","10-14","15-17","18-19","20","21",
                "22-24","25-29","30-34","35-39","40-44","45-49",
                "50-54","55-59","60-61","62-64","65-66","67-69",
                "70-74","75-79","80-84","85+")

NASS[, AGE_GROUP := cut(AGE, breaks = age_breaks,
                        labels = age_labels, right = TRUE)]

plot_df <- NASS[, .(white = sum(WHITE),
                    n     = .N),
                by = .(SEX = factor(FEMALE, labels=c("Male","Female")),
                       AGE_GROUP)]
plot_df[, prop := white/n]

gg_gender <- ggplot(plot_df, aes(x = AGE_GROUP, y = prop,
                                 group = SEX, color = SEX)) +
  geom_line(linewidth=1) +
  geom_point() +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "% White (NASS, simulated)", x = "Age-group",
       title = "Crude white proportion by age & sex") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1))
print(gg_gender)

---
## 14. R | Multilevel logistic models (hospital nested, 3 tiers)

In [None]:
%%R
features <- NASS[, .(WHITE,
                     FEMALE,
                     ZIPINC_QRTL,
                     PAY1,
                     CPTCCS1,
                     HOSP_LOCATION,
                     HOSP_TEACH,
                     HOSP_NASS)]

features[, c(names(features)) := lapply(.SD, as.factor)]

formulas <- list(
  m1 = WHITE ~ FEMALE + (1|HOSP_NASS),
  m2 = WHITE ~ FEMALE + ZIPINC_QRTL + (1|HOSP_NASS),
  m3 = WHITE ~ FEMALE + ZIPINC_QRTL + PAY1 + CPTCCS1 +
                    HOSP_LOCATION + HOSP_TEACH + (1|HOSP_NASS)
)

fit <- lapply(formulas, glmer, family = binomial, data = features,
              control = glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e4)))

sapply(fit, function(m) broom::tidy(m, effects = "fixed")[1:5,])

---
## 15. R | Compare AUC across the three models

In [None]:
%%R
library(pROC)
auc_vals <- sapply(fit, function(m){
  preds <- predict(m, type="response")
  roc(features$WHITE, preds)$auc
})
knitr::kable(data.frame(model = names(auc_vals), AUC = auc_vals),
             caption = "AUC (in-sample, simulated data)")

---
## 16. Python | Teardown Helper (Optional)

In [None]:
if not USE_DRIVE:
    print("Done ✅ — runtime will auto-delete downloaded CSV when session ends.")