[![Open on GitHub](https://img.shields.io/badge/GitHub-View%20Source-181717?style=for-the-badge&logo=github)](https://github.com/SeenaKhosravi/NASS/blob/main/NASS_Analysis.ipynb)
[![Open In Colab](https://img.shields.io/badge/Colab-Open%20Notebook-F9AB00?style=for-the-badge&logo=google-colab)](https://colab.research.google.com/github/SeenaKhosravi/NASS/blob/main/NASS_Analysis.ipynb)
[![Open in Vertex AI](https://img.shields.io/badge/Vertex%20AI-Open%20Workbench-4285F4?style=for-the-badge&logo=google-cloud)](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/SeenaKhosravi/NASS/main/NASS_Analysis.ipynb)


# Socioeconomic and Demographic Drivers of Ambulatory Surgery Usage
### HCUP NASS 2020 – Reproducible Pipeline (Python + R)

**Author:** Seena Khosravi, MD  
**Last Updated:** September 2, 2025  
**Data Source:** Healthcare Cost and Utilization Project (HCUP) NASS 2020

---

## Overview
This notebook provides a reproducible analysis pipeline for examining socioeconomic and demographic factors influencing ambulatory surgery usage patterns. The analysis combines Python for data processing and R for statistical modeling.

### Data Usage Agreement
**DUA Compliant** — This notebook uses a simulated dataset unless you have purchased the full dataset from HCUP. The full dataset can be loaded from your local storage or cloud storage.

### Key Features
- **Platform Agnostic:** Works in Jupyter, Colab, Vertex AI, or local environments
- **Flexible Data Sources:** GitHub (simulated), Google Cloud Storage, or local files
- **Reproducible:** All dependencies and environment setup included
- **Scalable:** Handles both simulated (1GB, 700k rows) and full dataset (12 GB, 7.8M rows)

---

## Design Notes

### Architecture
- **Python cells** handle "plumbing" (file I/O, environment setup, rpy2 configuration, data previews)
- **R cells** (prefixed by `%%R`) perform statistical analysis: survey weights, Census lookups, multilevel models, plots

### Data Sources
- **Default:** Simulated dataset (1GB) from GitHub releases
- **Local:** Switch to locally stored files via configuration
- **Cloud:** Google Cloud Storage support for large datasets
- **Full Dataset:** 7.8M row HCUP release (requires DUA compliance)

### Environment Support
- Jupyter Notebook
- Google Colab
- Vertex AI Workbench
- Local Python/R environments

---

---
## 1. Configuration & Setup

### Data Source Configuration
Choose your data source and toggle verbose output for debugging.

In [None]:
# Configuration Settings
# Choose your data source and toggle verbose mode for debugging

USE_DRIVE = False       # True → mount Google Drive and read full HCUP files
VERBOSE_PRINTS = True   # False → suppress head()/str() previews

---
## 2. Optional Drive Mount

For Google Colab users who want to access files from Google Drive.

In [None]:
if USE_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')

---
## 3. Data Loading

### Multiple Data Source Support
This section provides a flexible data loader that supports GitHub, local files, and Google Cloud Storage.

In [None]:
# ==================== USER CONFIGURATION ====================
DATA_SOURCE = "github"  # Options: "local", "github", "gcs"

# GitHub source (default - simulated data)
GITHUB_URL = "https://github.com/SeenaKhosravi/NASS/releases/download/v1.0.0/nass_2020_simulated.csv"

# Local source (relative to environment's home directory)
LOCAL_FILENAME = "nass_2020_local.csv"

# Google Cloud Storage source
GCS_BUCKET = "nass_2020"
GCS_BLOB = "nass_2020_all.csv"
USE_EXPLICIT_AUTH = False
SERVICE_ACCOUNT_KEY_PATH = "/path/to/service-account-key.json"
# ===========================================================

import pathlib
import os
import requests
import pandas as pd

class DataLoader:
    """Flexible data loader supporting multiple sources and environments"""
    
    def __init__(self):
        self.detect_environment()
        self.setup_paths()
        
    def detect_environment(self):
        """Detect if we're in Colab, Vertex AI, or local environment"""
        self.is_colab = 'COLAB_GPU' in os.environ
        self.environment = "colab" if self.is_colab else "vertex/local"
        print(f"Environment detected: {self.environment}")
    
    def setup_paths(self):
        """Set appropriate data directory based on environment"""
        if self.is_colab:
            self.base_path = pathlib.Path('/content')
        else:
            self.base_path = pathlib.Path.home()
        
        self.data_dir = self.base_path / 'data'
        self.data_dir.mkdir(exist_ok=True)
        print(f"Data directory: {self.data_dir}")
    
    def load_from_github(self):
        """Load simulated data from GitHub releases"""
        try:
            print(f"Downloading from: {GITHUB_URL}")
            response = requests.get(GITHUB_URL, timeout=30)
            response.raise_for_status()
            
            data_path = self.data_dir / "nass_data.csv"
            with open(data_path, 'wb') as f:
                f.write(response.content)
            
            print(f"✓ Successfully downloaded from GitHub")
            return pd.read_csv(data_path)
            
        except Exception as e:
            raise Exception(f"GitHub download failed: {str(e)}")
    
    def load_from_local(self):
        """Load data from local file"""
        # Check multiple possible locations
        possible_paths = [
            self.base_path / LOCAL_FILENAME,
            self.data_dir / LOCAL_FILENAME,
            pathlib.Path(LOCAL_FILENAME)  # Current directory
        ]
        
        for local_path in possible_paths:
            if local_path.exists():
                print(f"✓ Found local file: {local_path}")
                return pd.read_csv(local_path)
        
        raise FileNotFoundError(f"Local file not found. Checked: {[str(p) for p in possible_paths]}")
    
    def load_from_gcs(self):
        """Load data from Google Cloud Storage"""
        try:
            from google.cloud import storage
            
            # Setup GCS client
            if USE_EXPLICIT_AUTH and os.path.exists(SERVICE_ACCOUNT_KEY_PATH):
                client = storage.Client.from_service_account_json(SERVICE_ACCOUNT_KEY_PATH)
            else:
                client = storage.Client()  # Use default credentials
            
            bucket = client.bucket(GCS_BUCKET)
            blob = bucket.blob(GCS_BLOB)
            
            if not blob.exists():
                raise FileNotFoundError(f"GCS blob not found: {GCS_BUCKET}/{GCS_BLOB}")
            
            data_path = self.data_dir / "nass_data.csv"
            blob.download_to_filename(data_path)
            
            print(f"✓ Successfully downloaded from GCS: {GCS_BUCKET}/{GCS_BLOB}")
            return pd.read_csv(data_path)
            
        except Exception as e:
            raise Exception(f"GCS loading failed: {str(e)}")
    
    def load_data(self):
        """Main method to load data based on user selection"""
        print(f"Loading data from: {DATA_SOURCE.upper()}")
        
        try:
            if DATA_SOURCE.lower() == "github":
                return self.load_from_github()
            elif DATA_SOURCE.lower() == "local":
                return self.load_from_local()
            elif DATA_SOURCE.lower() == "gcs":
                return self.load_from_gcs()
            else:
                raise ValueError(f"Invalid DATA_SOURCE: {DATA_SOURCE}. Use 'github', 'local', or 'gcs'")
                
        except Exception as e:
            print(f"❌ Error: {str(e)}")
            self._print_troubleshooting()
            raise
    
    def _print_troubleshooting(self):
        """Print helpful troubleshooting messages"""
        if DATA_SOURCE == "github":
            print("\n📋 GitHub Troubleshooting:")
            print("1. Check internet connection")
            print("2. Verify URL accessibility")
            
        elif DATA_SOURCE == "local":
            print("\n📋 Local File Troubleshooting:")
            print(f"1. Check if file exists: {self.base_path / LOCAL_FILENAME}")
            print(f"2. Or in data directory: {self.data_dir / LOCAL_FILENAME}")
            
        elif DATA_SOURCE == "gcs":
            print("\n📋 GCS Troubleshooting:")
            print("1. Verify GCP authentication")
            print("2. Check bucket and file permissions")
            print("3. Ensure google-cloud-storage is installed")

# Execute data loading
try:
    loader = DataLoader()
    df = loader.load_data()
    
    print(f"\n✅ Data loaded successfully!")
    print(f"Shape: {df.shape}")
    if VERBOSE_PRINTS:
        print("\nFirst 5 rows:")
        print(df.head())
    
except Exception as e:
    print(f"\n❌ Failed to load data. Please check your configuration.")
    print("Consider switching DATA_SOURCE or checking your file paths.")

✔️ Data ready → /content/nass_2020_simulated.csv


---
## 4. Data Verification

Verify the loaded data structure and preview first few rows.

In [None]:
# Basic data verification
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:")
print(df.dtypes)

if VERBOSE_PRINTS:
    print(f"\nMissing values:")
    print(df.isnull().sum())
    print(f"\nBasic statistics:")
    print(df.describe())

---
## 5. R Environment Setup

### Load rpy2 Extension
Enable R integration within Python notebook environment.

In [5]:
%load_ext rpy2.ipython

---
## 6. R Package Installation and Loading

Install required R packages and load them for the analysis.

In [None]:
%%R
req_pkgs <- c("data.table","survey","dplyr","tidyverse","tidycensus",
              "ggplot2","gridExtra","pROC","broom","lme4")
new <- req_pkgs[!req_pkgs %in% installed.packages()[,"Package"]]
if(length(new)) install.packages(new, repos = "https://cloud.r-project.org")

# Check which packages were successfully installed and load them
installed <- installed.packages()[,"Package"]
loaded_pkgs <- c()
failed_pkgs <- c()
for (pkg in req_pkgs) {
  if (pkg %in% installed) {
    library(pkg, character.only = TRUE)
    loaded_pkgs <- c(loaded_pkgs, pkg)
  } else {
    failed_pkgs <- c(failed_pkgs, pkg)
  }
}

if (length(failed_pkgs) > 0) {
  cat("Warning: The following packages failed to install and were not loaded:", paste(failed_pkgs, collapse = ", "), "\n")
}

---
## 7. R Data Processing

Read the CSV with data.table (fast) and perform initial type coercion.

In [None]:
%%R -i df -i VERBOSE_PRINTS
options(datatable.print.nrows = 10)

NASS <- as.data.table(df)
if (VERBOSE_PRINTS) print(NASS[1:10])

# Light type coercion
num_cols  <- c("AGE","DISCWT","TOTCHG","TOTAL_AS_ENCOUNTERS")
NASS[, (num_cols) := lapply(.SD, as.numeric), .SDcols = num_cols]

# Boolean helper
NASS[, WHITE := fifelse(RACE == 1, 1, 0)]

---
## 8. R Analysis - Headline Counts

Replicate headline counts and verify data structure.

In [None]:
%%R
cat("Rows:", nrow(NASS), "  Cols:", ncol(NASS), "\n")
top10 <- NASS[, .N, by = CPTCCS1][order(-N)][1:10]
knitr::kable(top10, caption = "Top 10 CPTCCS1 counts (simulated)")

---
## 9. R Analysis - Income Quartile vs Procedure

Visualize income distribution within the most common procedures.

In [None]:
%%R
top_codes <- top10$CPTCCS1
plt_income <- NASS[CPTCCS1 %in% top_codes] |>
  ggplot(aes(x = fct_infreq(CPTCCS1), fill = ZIPINC_QRTL)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip() +
  labs(y = "Share within CPT", x = "CPTCCS1", fill = "ZIP Quartile",
       title = "Income distribution within 10 most-common procedures")
print(plt_income)

---
## 10. Census API Setup

Set up environment variable for Census API key.

In [None]:
import getpass, os, json, textwrap
os.environ["CENSUS_API_KEY"] = getpass.getpass("Enter your Census API key (will not echo):")

---
## 11. R | Set Census Key & Pull 2020 DHC Totals

In [None]:
%%R -i states_in_nass=character() -i VERBOSE_PRINTS
# If you've already installed the key once, this is a no-op
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"), overwrite = FALSE, install = FALSE)

get_vars <- function(base) sprintf("%s_%03dN", base, 1:49)

vars_total <- get_vars("P12")
vars_white <- get_vars("P12I")

pull_state_totals <- function(vars){
  get_decennial(geography = "state",
                variables = vars,
                year = 2020, sumfile = "dhc") |>
  group_by(NAME) |> summarise(total = sum(value))
}

total_pop  <- pull_state_totals(vars_total)
white_pop  <- pull_state_totals(vars_white)

census_prop <- merge(total_pop, white_pop, by = "NAME",
                     suffixes = c("_all","_white"))
census_prop[, prop_white := total_white / total_all]

if (VERBOSE_PRINTS) head(census_prop)

---
## 12. R | Weighted vs Unweighted Proportion Test

In [None]:
%%R
library(survey)

# Survey design using provided discharge weight
des <- svydesign(ids = ~1, weights = ~DISCWT, data = NASS)

unweighted_hat <- mean(NASS$WHITE)
weighted_hat   <- svymean(~WHITE, des)[1]

us_prop <- weighted.mean(census_prop$prop_white,
                         w = census_prop$total_all)

cat(sprintf("Unweighted NASS white %%: %.3f\n", unweighted_hat))
cat(sprintf("Weighted   NASS white %%: %.3f\n", weighted_hat))
cat(sprintf("2020 Census (all NASS states) white %%: %.3f\n", us_prop))

svytest <- svyciprop(~WHITE, des,
                     method = "likelihood", level = 0.95)
print(svytest)

---
## 13. R | Age-by-sex plot vs Census (adapted from `agesociodiv.r`)

In [None]:
%%R
age_breaks <- c(-Inf,4,9,14,17,19,20,21,24,29,34,39,44,49,54,59,61,64,
                66,69,74,79,84,Inf)
age_labels <- c("U5","5-9","10-14","15-17","18-19","20","21",
                "22-24","25-29","30-34","35-39","40-44","45-49",
                "50-54","55-59","60-61","62-64","65-66","67-69",
                "70-74","75-79","80-84","85+")

NASS[, AGE_GROUP := cut(AGE, breaks = age_breaks,
                        labels = age_labels, right = TRUE)]

plot_df <- NASS[, .(white = sum(WHITE),
                    n     = .N),
                by = .(SEX = factor(FEMALE, labels=c("Male","Female")),
                       AGE_GROUP)]
plot_df[, prop := white/n]

gg_gender <- ggplot(plot_df, aes(x = AGE_GROUP, y = prop,
                                 group = SEX, color = SEX)) +
  geom_line(linewidth=1) +
  geom_point() +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "% White (NASS, simulated)", x = "Age-group",
       title = "Crude white proportion by age & sex") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=45, hjust=1))
print(gg_gender)

---
## 14. R | Multilevel logistic models (hospital nested, 3 tiers)

In [None]:
%%R
features <- NASS[, .(WHITE,
                     FEMALE,
                     ZIPINC_QRTL,
                     PAY1,
                     CPTCCS1,
                     HOSP_LOCATION,
                     HOSP_TEACH,
                     HOSP_NASS)]

features[, c(names(features)) := lapply(.SD, as.factor)]

formulas <- list(
  m1 = WHITE ~ FEMALE + (1|HOSP_NASS),
  m2 = WHITE ~ FEMALE + ZIPINC_QRTL + (1|HOSP_NASS),
  m3 = WHITE ~ FEMALE + ZIPINC_QRTL + PAY1 + CPTCCS1 +
                    HOSP_LOCATION + HOSP_TEACH + (1|HOSP_NASS)
)

fit <- lapply(formulas, glmer, family = binomial, data = features,
              control = glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=2e4)))

sapply(fit, function(m) broom::tidy(m, effects = "fixed")[1:5,])

---
## 15. R | Compare AUC across the three models

In [None]:
%%R
library(pROC)
auc_vals <- sapply(fit, function(m){
  preds <- predict(m, type="response")
  roc(features$WHITE, preds)$auc
})
knitr::kable(data.frame(model = names(auc_vals), AUC = auc_vals),
             caption = "AUC (in-sample, simulated data)")

---
## 16. Python | Teardown Helper (Optional)

In [None]:
if not USE_DRIVE:
    print("Done ✅ — runtime will auto-delete downloaded CSV when session ends.")