[![Open on GitHub](https://img.shields.io/badge/GitHub-View%20Source-181717?style=for-the-badge&logo=github)](https://github.com/SeenaKhosravi/NASS/blob/main/NASS_Analysis.ipynb)
[![Open In Colab](https://img.shields.io/badge/Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=google-colab)](https://colab.research.google.com/github/SeenaKhosravi/NASS/blob/main/NASS_Analysis.ipynb)
[![Open in Vertex AI](https://img.shields.io/badge/Vertex%20AI-Open%20Workbench-4285F4?style=for-the-badge&logo=google-cloud)](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/SeenaKhosravi/NASS/main/NASS_Analysis.ipynb)


# Socioeconomic and Demographic Drivers of Ambulatory Surgery Usage
### HCUP NASS 2020 – Reproducible Pipeline (Python + R)

**Author:** Seena Khosravi, MD  
**Last Updated:** September 2, 2025  
**Data Source:** Healthcare Cost and Utilization Project (HCUP) NASS 2020

---

## Overview
This notebook provides a reproducible analysis pipeline for examining socioeconomic and demographic factors influencing ambulatory surgery usage patterns. The analysis combines Python for data processing and R for statistical modeling.

### Data Usage Agreement
**DUA Compliant** — This notebook uses a simulated dataset unless you have purchased the full dataset from HCUP. The full dataset can be loaded from your local storage or cloud storage.

### Key Features
- **Platform Agnostic:** Works in Jupyter, Colab, Vertex AI, or local environments
- **Flexible Data Sources:** GitHub (simulated), Google Cloud Storage, or local files
- **Reproducible:** All dependencies and environment setup included
- **Scalable:** Handles both simulated (1GB, 700k rows) and full dataset (12 GB, 7.8M rows)

---

## Design Notes

### Architecture
- **Python cells** handle "plumbing" (file I/O, environment setup, rpy2 configuration, data previews)
- **R cells** (prefixed by `%%R`) perform statistical analysis: survey weights, Census lookups, multilevel models, plots

### Data Sources
- **Default:** Simulated dataset (1GB) from GitHub releases
- **Local:** Switch to locally stored files via configuration
- **Cloud:** Google Cloud Storage support for large datasets
- **Full Dataset:** 7.8M row HCUP release (requires DUA compliance)

### Environment Support
- Jupyter Notebook
- Google Colab
- Vertex AI Workbench
- Local Python/R environments

---

---
## 1. Configuration

Configure all settings in one place - data sources, debugging options, and file paths.

In [None]:
# ==================== CONFIGURATION ====================
# Data Source Options
DATA_SOURCE = "github"      # Options: "github", "local", "gcs", "drive"
VERBOSE_PRINTS = True       # False → suppress debug output

# GitHub source (default - simulated data)
GITHUB_URL = "https://github.com/SeenaKhosravi/NASS/releases/download/v1.0.0/nass_2020_simulated.csv"

# Local file options
LOCAL_FILENAME = "nass_2020_local.csv"

# Google Cloud Storage options
GCS_BUCKET = "nass_2020"
GCS_BLOB = "nass_2020_all.csv"
GCS_SERVICE_ACCOUNT_KEY = "/path/to/service-account-key.json"  # Optional

# Google Drive options (for Colab)
DRIVE_PATH = "/content/drive/MyDrive/NASS/nass_2020_full.csv"
# ======================================================

print("✓ Configuration loaded")
print(f"  Data source: {DATA_SOURCE}")
print(f"  Verbose mode: {VERBOSE_PRINTS}")

---
## 2. Environment Setup & Package Installation

Smart package installation and environment detection.

In [None]:
import os
import sys
import subprocess
from pathlib import Path

class EnvironmentManager:
    def __init__(self):
        self.detect_environment()
        self.setup_packages()
    
    def detect_environment(self):
        """Detect runtime environment"""
        self.is_colab = 'COLAB_GPU' in os.environ or 'google.colab' in sys.modules
        self.is_vertex = 'DL_ANACONDA_HOME' in os.environ
        
        if self.is_colab:
            self.env_type = "Google Colab"
        elif self.is_vertex:
            self.env_type = "Vertex AI"
        else:
            self.env_type = "Local/Jupyter"
        
        print(f"Environment: {self.env_type}")
    
    def install_package(self, package, conda_name=None):
        """Smart package installation with fallback"""
        try:
            __import__(package)
            return True
        except ImportError:
            try:
                if conda_name and not self.is_colab:
                    subprocess.check_call(['conda', 'install', '-y', conda_name], 
                                        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                else:
                    subprocess.check_call([sys.executable, '-m', 'pip', 'install', package], 
                                        stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                return True
            except:
                return False
    
    def setup_packages(self):
        """Install required packages efficiently"""
        packages = {
            'pandas': None,
            'requests': None,
            'pathlib': None,  # Built-in, but check anyway
            'rpy2': 'rpy2',
            'google.cloud.storage': 'google-cloud-storage'
        }
        
        print("Installing/checking packages...")
        failed = []
        
        for pkg, install_name in packages.items():
            install_name = install_name or pkg
            if not self.install_package(pkg, install_name):
                failed.append(pkg)
        
        if failed:
            print(f"⚠️  Failed to install: {', '.join(failed)}")
            print("Some features may not work")
        else:
            print("✓ All packages ready")
        
        # Mount Google Drive if needed
        if DATA_SOURCE == "drive" and self.is_colab:
            self.mount_drive()
    
    def mount_drive(self):
        """Mount Google Drive in Colab"""
        try:
            from google.colab import drive
            drive.mount('/content/drive')
            print("✓ Google Drive mounted")
        except:
            print("❌ Failed to mount Google Drive")

# Initialize environment
env_manager = EnvironmentManager()

---
## 3. Data Loading

Streamlined data loader with robust error handling.

In [None]:
import pandas as pd
import requests
from pathlib import Path

class DataLoader:
    def __init__(self):
        self.base_path = Path('/content') if env_manager.is_colab else Path.home()
        self.data_dir = self.base_path / 'data'
        self.data_dir.mkdir(exist_ok=True)
    
    def load_data(self):
        """Load data based on configuration"""
        loaders = {
            'github': self._load_github,
            'local': self._load_local,
            'gcs': self._load_gcs,
            'drive': self._load_drive
        }
        
        if DATA_SOURCE not in loaders:
            raise ValueError(f"Invalid DATA_SOURCE: {DATA_SOURCE}")
        
        print(f"Loading data from: {DATA_SOURCE.upper()}")
        return loaders[DATA_SOURCE]()
    
    def _load_github(self):
        """Load from GitHub releases"""
        response = requests.get(GITHUB_URL, timeout=60)
        response.raise_for_status()
        
        data_path = self.data_dir / "nass_data.csv"
        data_path.write_bytes(response.content)
        
        print(f"✓ Downloaded from GitHub ({response.headers.get('content-length', 'unknown')} bytes)")
        return pd.read_csv(data_path)
    
    def _load_local(self):
        """Load from local file"""
        search_paths = [
            self.base_path / LOCAL_FILENAME,
            self.data_dir / LOCAL_FILENAME,
            Path.cwd() / LOCAL_FILENAME
        ]
        
        for path in search_paths:
            if path.exists():
                print(f"✓ Found local file: {path}")
                return pd.read_csv(path)
        
        raise FileNotFoundError(f"File not found in: {[str(p) for p in search_paths]}")
    
    def _load_gcs(self):
        """Load from Google Cloud Storage"""
        from google.cloud import storage
        
        # Smart authentication
        if Path(GCS_SERVICE_ACCOUNT_KEY).exists():
            client = storage.Client.from_service_account_json(GCS_SERVICE_ACCOUNT_KEY)
        else:
            client = storage.Client()  # Use default credentials
        
        bucket = client.bucket(GCS_BUCKET)
        blob = bucket.blob(GCS_BLOB)
        
        data_path = self.data_dir / "nass_data.csv"
        blob.download_to_filename(data_path)
        
        print(f"✓ Downloaded from GCS: {GCS_BUCKET}/{GCS_BLOB}")
        return pd.read_csv(data_path)
    
    def _load_drive(self):
        """Load from Google Drive (Colab only)"""
        if not env_manager.is_colab:
            raise RuntimeError("Drive loading only available in Google Colab")
        
        drive_path = Path(DRIVE_PATH)
        if not drive_path.exists():
            raise FileNotFoundError(f"Drive file not found: {DRIVE_PATH}")
        
        print(f"✓ Loading from Google Drive: {DRIVE_PATH}")
        return pd.read_csv(drive_path)

# Load data
try:
    loader = DataLoader()
    df = loader.load_data()
    
    print(f"✅ Data loaded successfully!")
    print(f"   Shape: {df.shape}")
    print(f"   Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
    
    if VERBOSE_PRINTS:
        print(f"\nColumns: {list(df.columns)}")
        print(f"\nFirst 3 rows:")
        print(df.head(3))
        
except Exception as e:
    print(f"❌ Data loading failed: {e}")
    print(f"💡 Try changing DATA_SOURCE or check file paths")
    raise

---
## 4. R Environment Setup

Load R integration and install R packages efficiently.

In [None]:
# Load rpy2 extension for R integration
try:
    %load_ext rpy2.ipython
    print("✓ R integration loaded")
except Exception as e:
    print(f"❌ Failed to load R integration: {e}")
    print("Install rpy2: pip install rpy2")
    raise

In [None]:
%%R -i VERBOSE_PRINTS

# Smart R package installation
required_packages <- c("data.table", "survey", "dplyr", "tidyverse", 
                      "tidycensus", "ggplot2", "gridExtra", "pROC", 
                      "broom", "lme4")

# Check what's already installed
installed_packages <- rownames(installed.packages())
missing_packages <- required_packages[!required_packages %in% installed_packages]

# Install missing packages
if(length(missing_packages) > 0) {
  cat("Installing R packages:", paste(missing_packages, collapse=", "), "\n")
  install.packages(missing_packages, repos="https://cloud.r-project.org", 
                   quiet=!VERBOSE_PRINTS, dependencies=TRUE)
}

# Load all packages
success <- sapply(required_packages, function(pkg) {
  suppressMessages(suppressWarnings(library(pkg, character.only=TRUE, quietly=TRUE)))
})

cat("✓ R packages loaded:", sum(success), "/", length(required_packages), "\n")
if(any(!success)) {
  cat("⚠️  Failed to load:", paste(names(success)[!success], collapse=", "), "\n")
}

In [None]:
%%R -i df -i VERBOSE_PRINTS
options(datatable.print.nrows = 10)

NASS <- as.data.table(df)
if (VERBOSE_PRINTS) print(NASS[1:10])

# Light type coercion
num_cols  <- c("AGE","DISCWT","TOTCHG","TOTAL_AS_ENCOUNTERS")
NASS[, (num_cols) := lapply(.SD, as.numeric), .SDcols = num_cols]

# Boolean helper
NASS[, WHITE := fifelse(RACE == 1, 1, 0)]

In [None]:
%%R
cat("Rows:", nrow(NASS), "  Cols:", ncol(NASS), "\n")
top10 <- NASS[, .N, by = CPTCCS1][order(-N)][1:10]
knitr::kable(top10, caption = "Top 10 CPTCCS1 counts (simulated)")

---
## 9. R Analysis - Income Quartile vs Procedure

Visualize income distribution within the most common procedures.

In [None]:
%%R
top_codes <- top10$CPTCCS1
plt_income <- NASS[CPTCCS1 %in% top_codes] |>
  ggplot(aes(x = fct_infreq(CPTCCS1), fill = ZIPINC_QRTL)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  labs(y = "Share within CPT", x = "CPTCCS1", fill = "ZIP Quartile",
       title = "Income distribution within 10 most-common procedures")
print(plt_income)

---
## 10. Census API Setup

Set up environment variable for Census API key.

In [None]:
import getpass, os, json, textwrap
os.environ["CENSUS_API_KEY"] = getpass.getpass("Enter your Census API key (will not echo):")

---
## 11. R | Set Census Key & Pull 2020 DHC Totals

In [None]:
%%R -i states_in_nass=character() -i VERBOSE_PRINTS
# If you've already installed the key once, this is a no-op
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"), overwrite = FALSE, install = FALSE)

get_vars <- function(base) sprintf("%s_%03dN", base, 1:49)

vars_total <- get_vars("P12")
vars_white <- get_vars("P12I")

pull_state_totals <- function(vars){
  get_decennial(geography = "state",
                variables = vars,
                year = 2020, sumfile = "dhc") |>
  group_by(NAME) |> summarise(total = sum(value))
}

total_pop  <- pull_state_totals(vars_total)
white_pop  <- pull_state_totals(vars_white)

census_prop <- merge(total_pop, white_pop, by = "NAME",
                     suffixes = c("_all","_white"))
census_prop[, prop_white := total_white / total_all]

if (VERBOSE_PRINTS) head(census_prop)

---
## 12. R | Weighted vs Unweighted Proportion Test

In [None]:
%%R
library(survey)

# Survey design using provided discharge weight
des <- svydesign(ids = ~1, weights = ~DISCWT, data = NASS)

unweighted_hat <- mean(NASS$WHITE)
weighted_hat   <- svymean(~WHITE, des)[1]

us_prop <- weighted.mean(census_prop$prop_white,
                         w = census_prop$total_all)

cat(sprintf("Unweighted NASS white %%: %.3f\n", unweighted_hat))
cat(sprintf("Weighted   NASS white %%: %.3f\n", weighted_hat))
cat(sprintf("2020 Census (all NASS states) white %%: %.3f\n", us_prop))

svytest <- svyciprop(~WHITE, des,
                     method = "likelihood", level = 0.95)
print(svytest)

---
## 13. R | Age-by-sex plot vs Census (adapted from `agesociodiv.r`)

In [None]:
%%R
age_breaks <- c(-Inf,4,9,14,17,19,20,21,24,29,34,39,44,49,54,59,61,64,
                66,69,74,79,84,Inf)
age_labels <- c("U5","5-9","10-14","15-17","18-19","20","21",
                "22-24","25-29","30-34","35-39","40-44","45-49",
                "50-54","55-59","60-61","62-64","65-66","67-69",
                "70-74","75-79","80-84","85+")

NASS[, AGE_GROUP := cut(AGE, breaks = age_breaks,
                        labels = age_labels, right = TRUE)]

plot_df <- NASS[, .(white = sum(WHITE),
                    n     = .N),
                by = .(SEX = factor(FEMALE, labels=c("Male","Female")),
                       AGE_GROUP)]
plot_df[, prop := white/n]

gg_gender <- ggplot(plot_df, aes(x = AGE_GROUP, y = prop,
                                 group = SEX, color = SEX)) +
  geom_line(linewidth=1) +
  geom_point() +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "% White (NASS, simulated)", x = "Age-group",
       title = "Crude white proportion by age & sex") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1))
print(gg_gender)

---
## 14. R | Multilevel logistic models (hospital nested, 3 tiers)

In [None]:
%%R
features <- NASS[, .(WHITE,
                     FEMALE,
                     ZIPINC_QRTL,
                     PAY1,
                     CPTCCS1,
                     HOSP_LOCATION,
                     HOSP_TEACH,
                     HOSP_NASS)]

features[, c(names(features)) := lapply(.SD, as.factor)]

formulas <- list(
  m1 = WHITE ~ FEMALE + (1|HOSP_NASS),
  m2 = WHITE ~ FEMALE + ZIPINC_QRTL + (1|HOSP_NASS),
  m3 = WHITE ~ FEMALE + ZIPINC_QRTL + PAY1 + CPTCCS1 +
                    HOSP_LOCATION + HOSP_TEACH + (1|HOSP_NASS)
)

fit <- lapply(formulas, glmer, family = binomial, data = features,
              control = glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e4)))

sapply(fit, function(m) broom::tidy(m, effects = "fixed")[1:5,])

---
## 15. R | Compare AUC across the three models

In [None]:
%%R
library(pROC)
auc_vals <- sapply(fit, function(m){
  preds <- predict(m, type="response")
  roc(features$WHITE, preds)$auc
})
knitr::kable(data.frame(model = names(auc_vals), AUC = auc_vals),
             caption = "AUC (in-sample, simulated data)")

---
## 16. Python | Teardown Helper (Optional)

In [None]:
if not USE_DRIVE:
    print("Done ✅ — runtime will auto-delete downloaded CSV when session ends.")