# 03 – BLS Nursing Workforce EDA

## Overview
This notebook prepares Bureau of Labor Statistics (BLS) Occupational Employment and Wage Statistics
(OEWS) data for use in analyzing how regional nursing workforce conditions relate to hospital staffing
capacity and quality outcomes.

The goal is to construct clean, comparable regional nurse labor market indicators that can be merged
with hospital-level CMS data using geographic identifiers.

## Data Source & Scope
**Source:** U.S. Bureau of Labor Statistics – Occupational Employment and Wage Statistics (OEWS)  
**Occupation:** Registered Nurses (SOC 29-1141)  
**Unit of analysis:** Geographic labor market (MSA and State)  
**Geographic levels:** Metropolitan Statistical Area (MSA) and State  
**Time structure:** 3-year rolling averages (lagged relative to CMS data)

This dataset provides regional measures of nursing workforce supply and compensation, including
employment counts, employment concentration, and wage levels. These variables serve as contextual
labor market indicators rather than hospital-specific staffing measures.

## 1. Load and Initial Cleaning
This section loads raw BLS OEWS nursing workforce data and performs initial cleaning steps,
including column standardization, occupation filtering, and basic validation of geographic identifiers.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 160)

# --- Project paths (01-style, robust) ---
ROOT = Path.cwd().resolve()

# If you're already inside /work/src/notebooks, climb to /work
if ROOT.name == "notebooks" and ROOT.parent.name == "src":
    ROOT = ROOT.parents[1]          # /work
elif ROOT.name == "src":
    ROOT = ROOT.parent              # /work

SRC = ROOT / "src"
RAW_DIR = SRC / "data" / "raw"
CLEAN_DIR = SRC / "data" / "clean"

RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

MSA_RAW_PATH = RAW_DIR / "BLS_MSANursingWorkforceData.csv"
STATE_RAW_PATH = RAW_DIR / "BLS_StateNursingWorkforceData.csv"

MSA_CLEAN_PATH = CLEAN_DIR / "bls_msa_rn_clean.csv"
STATE_CLEAN_PATH = CLEAN_DIR / "bls_state_rn_clean.csv"

print("CWD:", Path.cwd())
print("Resolved ROOT:", ROOT)
print("MSA raw:", MSA_RAW_PATH)
print("State raw:", STATE_RAW_PATH)
print("MSA clean:", MSA_CLEAN_PATH)
print("State clean:", STATE_CLEAN_PATH)

assert MSA_RAW_PATH.exists(), f"Missing file: {MSA_RAW_PATH}"
assert STATE_RAW_PATH.exists(), f"Missing file: {STATE_RAW_PATH}"


def bls_to_numeric(series: pd.Series) -> pd.Series:
    """Convert BLS fields with suppression markers (#,*,—) and commas to numeric."""
    if pd.api.types.is_numeric_dtype(series):
        return series

    s = series.astype("string").str.strip()
    s = s.replace(["#", "*", "**", "—", ""], pd.NA)
    s = s.str.replace(",", "", regex=False)

    return pd.to_numeric(s, errors="coerce")

CWD: /voc/work/src/notebooks
Resolved ROOT: /voc/work
MSA raw: /voc/work/src/data/raw/BLS_MSANursingWorkforceData.csv
State raw: /voc/work/src/data/raw/BLS_StateNursingWorkforceData.csv
MSA clean: /voc/work/src/data/clean/bls_msa_rn_clean.csv
State clean: /voc/work/src/data/clean/bls_state_rn_clean.csv


In [2]:
msa = pd.read_csv(MSA_RAW_PATH, low_memory=False)
msa.columns = msa.columns.str.strip()
print("MSA shape:", msa.shape)

state = pd.read_csv(STATE_RAW_PATH, low_memory=False)
state.columns = state.columns.str.strip()
print("State shape:", state.shape)

MSA shape: (391, 27)
State shape: (54, 27)


## 2. Filter to Registered Nurse Occupation
The analysis is restricted to Registered Nurses (SOC code 29-1141), which aligns with the nursing
staffing focus of the project. All other occupations are excluded.

In [3]:
RN_CODE = "29-1141"

msa["OCC_CODE"] = msa["OCC_CODE"].astype(str).str.strip()
msa_rn = msa[msa["OCC_CODE"] == RN_CODE].copy()
print("MSA RN shape:", msa_rn.shape)

state["OCC_CODE"] = state["OCC_CODE"].astype(str).str.strip()
state_rn = state[state["OCC_CODE"] == RN_CODE].copy()
print("State RN shape:", state_rn.shape)


MSA RN shape: (391, 27)
State RN shape: (54, 27)


## 3. Numeric Standardization and Suppression Handling
BLS data include suppression symbols and formatting artifacts (e.g., '#', '*', commas) that must
be cleaned before analysis. This section converts key employment and wage variables to numeric
form and documents the extent of missing or suppressed data.

In [4]:
num_cols = ["TOT_EMP", "A_MEAN", "JOBS_1000", "H_MEAN", "EMP_PRSE", "LOC_QUOTIENT", "MEAN_PRSE", "AREA_TYPE"]

for c in num_cols:
    if c in msa_rn.columns:
        msa_rn[c] = bls_to_numeric(msa_rn[c])

for c in num_cols:
    if c in state_rn.columns:
        state_rn[c] = bls_to_numeric(state_rn[c])

print("MSA RN missing %:")
display(msa_rn[[c for c in ["TOT_EMP", "A_MEAN", "JOBS_1000"] if c in msa_rn.columns]].isna().mean())

print("State RN missing %:")
display(state_rn[[c for c in ["TOT_EMP", "A_MEAN", "JOBS_1000"] if c in state_rn.columns]].isna().mean())

print("MSA AREA_TYPE counts:")
if "AREA_TYPE" in msa_rn.columns:
    display(msa_rn["AREA_TYPE"].value_counts(dropna=False).sort_index())

print("State AREA_TYPE counts:")
if "AREA_TYPE" in state_rn.columns:
    display(state_rn["AREA_TYPE"].value_counts(dropna=False).sort_index())

MSA RN missing %:


TOT_EMP      0.000000
A_MEAN       0.012788
JOBS_1000    0.000000
dtype: float64

State RN missing %:


TOT_EMP      0.000000
A_MEAN       0.018519
JOBS_1000    0.000000
dtype: float64

MSA AREA_TYPE counts:


AREA_TYPE
4    391
Name: count, dtype: int64

State AREA_TYPE counts:


AREA_TYPE
2    51
3     3
Name: count, dtype: int64

## 4. MSA-Level Nursing Workforce Measures
This section prepares Metropolitan Statistical Area (MSA)-level nursing workforce indicators,
including employment counts, nurses per 1,000 jobs, and mean wages. These measures provide
fine-grained regional labor market context where data are available.

In [6]:
msa_keep = [c for c in [
    "AREA", "AREA_TITLE", "AREA_TYPE", "PRIM_STATE",
    "OCC_CODE", "TOT_EMP", "JOBS_1000", "A_MEAN", "H_MEAN",
    "EMP_PRSE", "LOC_QUOTIENT"
] if c in msa_rn.columns]

bls_msa_clean = msa_rn[msa_keep].copy()

# Standardize join keys
bls_msa_clean["AREA"] = bls_msa_clean["AREA"].astype(str).str.strip()
if "PRIM_STATE" in bls_msa_clean.columns:
    bls_msa_clean["PRIM_STATE"] = bls_msa_clean["PRIM_STATE"].astype(str).str.strip()

bls_msa_clean.to_csv(MSA_CLEAN_PATH, index=False)
print("Saved MSA clean:", MSA_CLEAN_PATH, bls_msa_clean.shape)


Saved MSA clean: /voc/work/src/data/clean/bls_msa_rn_clean.csv (391, 11)


## 5. State-Level Nursing Workforce Measures
Because MSA-level BLS data are frequently suppressed—especially in smaller or rural regions—
state-level nursing workforce measures are prepared as a fallback. These provide complete
coverage for nearly all hospitals, albeit at a coarser geographic resolution.


In [7]:
state_keep = [c for c in [
    "AREA", "AREA_TITLE", "AREA_TYPE", "PRIM_STATE",
    "OCC_CODE", "TOT_EMP", "JOBS_1000", "A_MEAN", "H_MEAN",
    "EMP_PRSE", "LOC_QUOTIENT"
] if c in state_rn.columns]

bls_state_clean = state_rn[state_keep].copy()

# Standardize join keys
bls_state_clean["AREA"] = bls_state_clean["AREA"].astype(str).str.strip()
if "PRIM_STATE" in bls_state_clean.columns:
    bls_state_clean["PRIM_STATE"] = bls_state_clean["PRIM_STATE"].astype(str).str.strip()

bls_state_clean.to_csv(STATE_CLEAN_PATH, index=False)
print("Saved State clean:", STATE_CLEAN_PATH, bls_state_clean.shape)


Saved State clean: /voc/work/src/data/clean/bls_state_rn_clean.csv (54, 11)


## 6. Save Cleaned Outputs for Merging
Cleaned MSA- and state-level nursing workforce datasets are saved for use in downstream
merging with CMS hospital cost report and quality rating data. Both geographic resolutions
are retained to support best-available matching strategies.


In [8]:
bls_msa = pd.read_csv(MSA_CLEAN_PATH, low_memory=False)
bls_state = pd.read_csv(STATE_CLEAN_PATH, low_memory=False)

bls_msa.shape, bls_state.shape

((391, 11), (54, 11))