# Simulated Outpatient Claims Analysis by HCPCS and Revenue Code

## Overview

This task involves simulating outpatient claim-level data, where each claim contains 1–5 service lines. Each line includes a HCPCS code, revenue center code, charge amount, number of units, and a service date. The simulated data is reshaped from wide to long format for line-level analysis.

We then focus on two commonly billed HCPCS codes (G0463, J0696), and summarize the data by HCPCS and revenue center code. The summary includes claim count, line count, unique provider count, total charge, min/max/mean charges, and every 10th percentile from p10 to p90. The final output is sorted and saved as an Excel file for further use.

## Simulate OPPS Claims

You wanted to simulate a realistic outpatient (OPPS) claim-level dataset, where:

- Each **row** represents **one claim**
- Each **claim** contains **multiple embedded service lines**
- Each **service line** includes:
  - A **HCPCS** code *(what service was provided)*
  - A **Revenue Center** code *(billing department)*
  - A **line-level charge amount**

This structure mirrors CMS claims files like:

- `OUTPATIENT_BASE` *(claim-level)*
- `OUTPATIENT_REVENUE` *(line-level)*

In [32]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

np.random.seed(42)

# Some reference values
hcpcs_pool = [
    "     ", "G0463", "J0696", "C8900", "A0428", "Q2035",
    "J2778", "J1100", "J1885", "J3489", "Q4101",
    "J0885", "J1569", "J0881", "J2505", "C9399",
    "C1713", "J1459", "Q9981", "Q5101", "J9206"
]
rev_center_pool = [
    "0450", "0456", "0510", "0360", "0300",
    "0270", "0320", "0250", "0636", "0260",
    "0420", "0430", "0440", "0460", "0480",
    "0490", "0520", "0730", "0610", "0910"
]

In [33]:
# Random revenue date generator
def random_date(start_date, end_date):
    delta = end_date - start_date
    return start_date + timedelta(days=random.randint(0, delta.days))

start = datetime(2025, 1, 1)
end = datetime(2025, 2, 1)

In [35]:
# Generate a single claim
def generate_claim(claim_id):
    claim = {
        "claim_id": claim_id,
        "bene_id": np.random.randint(2000, 2200),
        "prov_id": np.random.randint(300, 305)
    }

    n_lines = np.random.randint(1, 6)  # 1 to 5 lines per claim
    
    claim["line_ct"] = n_lines
    
    for i in range(n_lines):
        claim[f"line{i+1}_hcpcs"] = np.random.choice(hcpcs_pool)
        claim[f"line{i+1}_rev"] = np.random.choice(rev_center_pool)
        claim[f"line{i+1}_chrg"] = round(np.random.uniform(50, 1500), 2)
        claim[f"line{i+1}_units"] = np.random.randint(1, 10)
        claim[f"line{i+1}_revdt"] = random_date(start, end).strftime("%Y-%m-%d")

    return claim


In [40]:
# Build DataFrame
n_claims = 1000
#claims_data = [generate_claim(1000 + i) for i in range(n_claims)]

claims_data = []
for i in range(n_claims):
    claim = generate_claim(i)
    claims_data.append(claim)
    
df_claims = pd.DataFrame(claims_data)

In [41]:
df_claims.head(5)

Unnamed: 0,claim_id,bene_id,prov_id,line_ct,line1_hcpcs,line1_rev,line1_chrg,line1_units,line1_revdt,line2_hcpcs,...,line4_hcpcs,line4_rev,line4_chrg,line4_units,line4_revdt,line5_hcpcs,line5_rev,line5_chrg,line5_units,line5_revdt
0,0,2113,301,3,J3489,456,517.7,8,2025-01-21,,...,,,,,,,,,,
1,1,2012,303,2,J2778,360,1207.53,1,2025-01-06,J1100,...,,,,,,,,,,
2,2,2142,303,2,G0463,610,725.12,6,2025-01-15,C1713,...,,,,,,,,,,
3,3,2187,304,1,J1100,430,1449.62,7,2025-01-07,,...,,,,,,,,,,
4,4,2195,304,4,J2505,510,974.27,4,2025-01-14,C9399,...,A0428,636.0,1048.32,1.0,2025-01-11,,,,,


In [42]:
df_claims.to_csv(r"J:/Python/Learning/Data/opps_claims_simulated.csv", index=False)

In [54]:
df_opps_claims = pd.read_csv(r"J:/Python/Learning/Data/opps_claims_simulated.csv")

In [58]:
df_opps_claims.head(5)
df_opps_claims.dtypes

claim_id         int64
bene_id          int64
prov_id          int64
line_ct          int64
line1_hcpcs     object
line1_rev        int64
line1_chrg     float64
line1_units      int64
line1_revdt     object
line2_hcpcs     object
line2_rev      float64
line2_chrg     float64
line2_units    float64
line2_revdt     object
line3_hcpcs     object
line3_rev      float64
line3_chrg     float64
line3_units    float64
line3_revdt     object
line4_hcpcs     object
line4_rev      float64
line4_chrg     float64
line4_units    float64
line4_revdt     object
line5_hcpcs     object
line5_rev      float64
line5_chrg     float64
line5_units    float64
line5_revdt     object
dtype: object

In [61]:
# for study code "G0463" and "J0696", 
# summarize how many claims, lines, prov, and total charge, 
# separate by rvcd. also rvcd should be like "0450" with leading zero.

import pandas as pd

# Reshape wide claim data to long format

line_rows = []

for _, row in df_claims.iterrows():  #df_claims.iterrows() goes through the DataFrame row by row
    n = row["line_ct"]
    for i in range(1, n + 1):
        line_rows.append({
            "claim_id": row["claim_id"],
            "bene_id": row["bene_id"],
            "prov_id": row["prov_id"],
            "line_ct": row["line_ct"],
            "line_no": i,
            "hcpcs": row.get(f"line{i}_hcpcs"),
            "rvcd": row.get(f"line{i}_rev"),
            "chrg": row.get(f"line{i}_chrg"),
            "units": row.get(f"line{i}_units"),
            "revdt": row.get(f"line{i}_revdt")
        })

df_lines = pd.DataFrame(line_rows)


In [64]:
df_lines.head(10)
df_lines.dtypes

claim_id      int64
bene_id       int64
prov_id       int64
line_ct       int64
line_no       int64
hcpcs        object
rvcd         object
chrg        float64
units       float64
revdt        object
dtype: object

In [66]:
df_lines_G0463_J0696 = df_lines[df_lines["hcpcs"].isin(["G0463", "J0696"])].copy()
df_lines_G0463_J0696["rvcd"] = df_lines_G0463_J0696["rvcd"].astype(str).str.zfill(4)

In [67]:
df_lines_G0463_J0696.head(5)

Unnamed: 0,claim_id,bene_id,prov_id,line_ct,line_no,hcpcs,rvcd,chrg,units,revdt
5,2,2142,303,2,1,G0463,610,725.12,6.0,2025-01-15
27,11,2163,300,5,2,J0696,520,1110.89,4.0,2025-01-23
30,11,2163,300,5,5,J0696,636,597.71,1.0,2025-01-05
65,24,2112,301,4,2,J0696,730,1285.66,6.0,2025-01-08
70,25,2084,301,3,3,G0463,910,843.8,1.0,2025-02-01


In [78]:
summary = df_lines_G0463_J0696.groupby(["hcpcs", "rvcd"]).agg(
    claims_count=("claim_id", "nunique"),
    lines_count=("hcpcs", "count"),
    unique_prov_count=("prov_id", "nunique"),
    unique_bene_count=("bene_id", "nunique"),
    total_unit=("units", "sum"),
    total_chrg=("chrg", "sum"),
    min_chrg=("chrg", "min"),
    max_chrg=("chrg", "max"),
    mean_chrg=("chrg", "mean"),
    p10=("chrg", lambda x: x.quantile(0.1)),
    p20=("chrg", lambda x: x.quantile(0.2)),
    p30=("chrg", lambda x: x.quantile(0.3)),
    p40=("chrg", lambda x: x.quantile(0.4)),
    p50=("chrg", lambda x: x.quantile(0.5)),
    p60=("chrg", lambda x: x.quantile(0.6)),
    p70=("chrg", lambda x: x.quantile(0.7)),
    p80=("chrg", lambda x: x.quantile(0.8)),
    p90=("chrg", lambda x: x.quantile(0.9))
).reset_index().sort_values(by=["hcpcs", "total_chrg"], ascending=[True, False])

In [79]:
summary

Unnamed: 0,hcpcs,rvcd,claims_count,lines_count,unique_prov_count,unique_bene_count,total_unit,total_chrg,min_chrg,max_chrg,mean_chrg,p10,p20,p30,p40,p50,p60,p70,p80,p90
14,G0463,510,11,11,4,11,44.0,10000.46,137.11,1368.37,909.132727,196.28,546.06,915.72,1002.27,1045.17,1059.62,1184.64,1268.01,1277.21
15,G0463,520,12,12,4,12,78.0,9125.97,218.23,1466.25,760.4975,224.327,310.486,647.943,694.068,740.185,851.162,1010.681,1083.54,1134.949
17,G0463,636,10,10,4,10,55.0,6837.92,206.93,1359.01,683.792,234.083,308.66,484.925,691.574,786.595,793.488,821.751,879.858,954.838
5,G0463,360,6,6,3,6,23.0,6501.56,510.75,1483.82,1083.593333,652.21,793.67,954.25,1114.83,1198.43,1282.03,1299.245,1316.46,1400.14
16,G0463,610,10,10,5,10,63.0,5746.91,76.62,1429.52,574.691,183.018,272.008,353.292,384.702,448.585,595.592,747.143,829.614,1001.507
12,G0463,480,7,7,3,7,24.0,5595.05,127.76,1317.0,799.292857,260.054,443.134,727.786,823.922,825.8,876.662,977.056,1176.514,1272.6
10,G0463,456,6,6,3,6,31.0,5581.81,801.99,1079.66,930.301667,808.93,815.87,852.035,888.2,923.965,959.73,998.045,1036.36,1058.01
8,G0463,440,6,6,4,6,39.0,5354.03,353.24,1483.81,892.338333,447.05,540.86,624.045,707.23,788.545,869.86,1134.445,1399.03,1441.42
6,G0463,420,6,6,3,6,22.0,5182.48,443.74,1345.71,863.746667,472.295,500.85,576.19,651.53,784.93,918.33,1120.325,1322.32,1334.015
7,G0463,430,5,5,4,5,36.0,5049.29,531.36,1406.3,1009.858,714.924,898.488,997.338,1011.474,1025.61,1053.666,1081.722,1157.86,1282.08


In [82]:
summary.to_excel(r"J:\Python\Learning\A2_Simulated_OPPS_Claims_Analysis_by_HCPCS_and_Revenue Code\Summary_hcpcs_rvcd_G0463_J0696.xlsx", index=False)