
# 🧠 Provider Behavior EDA Plan

This notebook explores provider behavior using metrics such as hit rate, recovery rate, dispute patterns, and medical record receipt compliance.

---

## 🔍 High-Level Goals of EDA

1. **Segment** providers by behavior (e.g., high/low performing, compliant/risky).
2. **Understand** how provider behavior varies by volume (bucket).
3. **Identify** correlations between audit outcomes and disputes, recovery, MRR.
4. **Find** outliers and high-impact providers.
5. **Prepare** to inform strategy (e.g., prioritize audits, flag risky providers).

---

## ✅ Columns in the Dataset

- `providertaxid`
- `findings`, `no_findings`, `total_audits`
- `dispute_ratio`, `overturned_ratio`
- `mr_receiving_rate`
- `cancelled_count`
- `hit_rate`, `recovery_rate`
- `total_overpay`, `total_recovery`
- `volume_bucket` (0–5, 5–15, 15–50, 50–100, 100+)

---

## 📊 EDA Steps and Visualizations

### 1. Summary Stats per Bucket
- Mean, median, std by `volume_bucket` for key metrics
- Bar, box, violin plots

### 2. Distribution & Correlation
- Histograms, correlation heatmap
- Pairwise scatterplots

### 3. Volume vs Performance
- Focus on providers with >15 audits
- Scatterplots of audits vs metrics

### 4. Risk Profiling & Outliers
- Z-score or IQR method to flag outliers
- Scatter plots for risky behavior

### 5. (Optional) Regression & Clustering
- Linear regression
- KMeans clustering

### 6. Treating Low Volume Providers
- Low-confidence group
- Use summary stats, avoid drawing hard conclusions

---

## 📤 Final Deliverables

- Summary dashboard and plots
- High-risk provider list
- Insight bullets


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats

# Load dataset
df = pd.read_csv("your_provider_data.csv")  # Replace with actual file path


In [None]:
# Summary statistics by volume_bucket
bucket_summary = df.groupby('volume_bucket').agg({
    'hit_rate': ['mean', 'median', 'std'],
    'recovery_rate': ['mean', 'median', 'std'],
    'dispute_ratio': ['mean', 'median'],
    'mr_receiving_rate': ['mean', 'median'],
    'total_overpay': 'sum',
    'total_recovery': 'sum',
    'providertaxid': 'count'
}).reset_index()
bucket_summary.columns = ['_'.join(col).strip('_') for col in bucket_summary.columns.values]
bucket_summary

In [None]:
# Boxplot of hit_rate by bucket
plt.figure(figsize=(10, 6))
sns.boxplot(x='volume_bucket', y='hit_rate', data=df)
plt.title("Hit Rate by Volume Bucket")
plt.show()


In [None]:
# Correlation heatmap
metrics = ['hit_rate', 'recovery_rate', 'dispute_ratio', 'overturned_ratio', 'mr_receiving_rate', 'total_overpay', 'total_recovery']
plt.figure(figsize=(10, 8))
sns.heatmap(df[metrics].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


In [None]:
# Flag outliers using IQR
def flag_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return ((series < (Q1 - 1.5 * IQR)) | (series > (Q3 + 1.5 * IQR)))

df['outlier_hit_rate'] = flag_outliers_iqr(df['hit_rate'])
df['outlier_recovery_rate'] = flag_outliers_iqr(df['recovery_rate'])
df['outlier_dispute'] = flag_outliers_iqr(df['dispute_ratio'])

# View flagged outliers
df[df[['outlier_hit_rate', 'outlier_recovery_rate', 'outlier_dispute']].any(axis=1)].head()


In [None]:
# Scatter plot: hit_rate vs recovery_rate
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='hit_rate', y='recovery_rate', hue='volume_bucket')
plt.title("Recovery Rate vs Hit Rate by Volume Bucket")
plt.show()
