This  notebook script generates synthetic medical claims, introduces logical errors (like doing knee surgery for a common cold), and trains a model to catch them.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# --- 1. DATA GENERATION (The "Real World" Simulation) ---
np.random.seed(42)
n_claims = 1000

# Generate random claim data
data = pd.DataFrame({
    'claim_id': range(n_claims),
    'age': np.random.randint(18, 90, n_claims),
    'provider_type': np.random.choice(['General', 'Surgeon', 'Cardiologist'], n_claims),
    'service_code': np.random.choice(['Office Visit', 'Knee Surgery', 'Heart Surgery'], n_claims),
    'has_pre_auth': np.random.choice([0, 1], n_claims, p=[0.3, 0.7]), # 30% missing pre-auth
    'diagnosis_match': np.random.choice([0, 1], n_claims, p=[0.2, 0.8]) # 20% mismatch diagnosis
})

# --- 2. LOGIC FOR DENIAL (The "Ground Truth") ---
# We define the rules the AI must "learn" purely from looking at the data.
def determine_denial(row):
    # Rule 1: Surgeries REQUIRE Pre-Auth. If missing, DENY.
    if 'Surgery' in row['service_code'] and row['has_pre_auth'] == 0:
        return 1
    # Rule 2: Diagnosis must match procedure. If mismatch, DENY.
    if row['diagnosis_match'] == 0:
        return 1
    # Rule 3: Random administrative errors (5% chance)
    if np.random.random() < 0.05:
        return 1
    return 0 # Approved

data['is_denied'] = data.apply(determine_denial, axis=1)

# --- 3. PREPROCESSING (Linear Algebra Step) ---
# Computers can't read strings like 'Surgeon'. We convert them to numbers (Encoding).
# In real life, we use One-Hot Encoding. Here, we use simple mapping for readability.
data_encoded = data.copy()
data_encoded['provider_type'] = data_encoded['provider_type'].map({'General': 0, 'Surgeon': 1, 'Cardiologist': 2})
data_encoded['service_code'] = data_encoded['service_code'].map({'Office Visit': 0, 'Knee Surgery': 1, 'Heart Surgery': 2})

# Features (X) vs Target (y)
X = data_encoded[['age', 'provider_type', 'service_code', 'has_pre_auth', 'diagnosis_match']]
y = data_encoded['is_denied']

# Split into Training (80%) and Testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 4. MODEL TRAINING (The "Learning" Step) ---
# We use a Random Forest. It uses "Information Gain" (Entropy) to find the best rules.
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# --- 5. EVALUATION (The Scorecard) ---
predictions = model.predict(X_test)

# --- 6. EXPLAINABILITY (Why?) ---
# Extract which features mattered most
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# --- OUTPUT RESULTS ---
print("--- MODEL PERFORMANCE ---")
print("Accuracy:", model.score(X_test, y_test))
print("\n--- TOP REASONS FOR DENIAL (Feature Importance) ---")
print(feature_importance)

print("\n--- TEST CASE ---")
# Let's test a specific BAD claim: Knee Surgery, No Pre-Auth
bad_claim = pd.DataFrame([[65, 1, 1, 0, 1]], columns=['age', 'provider_type', 'service_code', 'has_pre_auth', 'diagnosis_match'])
pred = model.predict(bad_claim)[0]
prob = model.predict_proba(bad_claim)[0][1]
print(f"Bad Claim Prediction: {'DENIED' if pred==1 else 'APPROVED'} (Confidence: {prob:.1%})")

--- MODEL PERFORMANCE ---
Accuracy: 0.965

--- TOP REASONS FOR DENIAL (Feature Importance) ---
           feature  importance
4  diagnosis_match    0.443178
3     has_pre_auth    0.303906
2     service_code    0.136677
0              age    0.100940
1    provider_type    0.015299

--- TEST CASE ---
Bad Claim Prediction: DENIED (Confidence: 100.0%)


In [None]:
import joblib

# 1. Train your real model (you already have this)
model.fit(X_train, y_train)

# 2. Save the model to a file
joblib.dump(model, 'denial_model.pkl') 

print("Model saved as denial_model.pkl")

Understanding the Output

When you run this code, you will typically see results like this:
1. Feature Importance (The "Why")

The model will output a table showing which columns drove the decision. You will notice that service_code, has_pre_auth, and diagnosis_match have very high importance scores (e.g., 0.30 - 0.40), while age and provider_type have very low scores.

    Math Insight: The algorithm calculated the "Gini Impurity." It realized that splitting the data based on has_pre_auth cleared up the confusion much faster than splitting by age.

2. The Test Case

We fed it a Bad Claim: Knee Surgery (1), No Pre-Auth (0).

    Prediction: DENIED

    Confidence: ~99%

    Why? The model successfully "learned" Rule #1 from our data generation step without us explicitly programming the if statement into the model itself.

How this maps to the Math Pipeline

    Vectorization: data_encoded turned "Knee Surgery" into [1]. The model sees a vector [65, 1, 1, 0, 1].

    Calculus (Optimization): During model.fit, the Random Forest tried thousands of splits. It minimized the "Entropy" (chaos) in the leaf nodes.

    Probability: predict_proba looked at how many of the 100 trees in the forest voted for "Denial." If 99 trees said "Deny," the probability is 0.99.