
# 🧠 Provider Behavior EDA Plan

This notebook explores provider behavior using metrics such as hit rate, recovery rate, dispute patterns, and medical record receipt compliance.

---

## 🔍 High-Level Goals of EDA

1. **Segment** providers by behavior (e.g., high/low performing, compliant/risky).
2. **Understand** how provider behavior varies by volume (bucket).
3. **Identify** correlations between audit outcomes and disputes, recovery, MRR.
4. **Find** outliers and high-impact providers.
5. **Prepare** to inform strategy (e.g., prioritize audits, flag risky providers).

---

## ✅ Columns in the Dataset

- `providertaxid`
- `findings`, `no_findings`, `total_audits`
- `dispute_ratio`, `overturned_ratio`
- `mr_receiving_rate`
- `cancelled_count`
- `hit_rate`, `recovery_rate`
- `total_overpay`, `total_recovery`
- `volume_bucket` (0–5, 5–15, 15–50, 50–100, 100+)

---

## 📊 EDA Steps and Visualizations

### 1. Summary Stats per Bucket
- Mean, median, std by `volume_bucket` for key metrics
- Bar, box, violin plots

### 2. Distribution & Correlation
- Histograms, correlation heatmap
- Pairwise scatterplots

### 3. Volume vs Performance
- Focus on providers with >15 audits
- Scatterplots of audits vs metrics

### 4. Risk Profiling & Outliers
- Z-score or IQR method to flag outliers
- Scatter plots for risky behavior

### 5. (Optional) Regression & Clustering
- Linear regression
- KMeans clustering

### 6. Treating Low Volume Providers
- Low-confidence group
- Use summary stats, avoid drawing hard conclusions

---

## 📤 Final Deliverables

- Summary dashboard and plots
- High-risk provider list
- Insight bullets


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy import stats

# Load dataset
df = pd.read_csv("your_provider_data.csv")  # Replace with actual file path


In [None]:
# Summary statistics by volume_bucket
bucket_summary = df.groupby('volume_bucket').agg({
    'hit_rate': ['mean', 'median', 'std'],
    'recovery_rate': ['mean', 'median', 'std'],
    'dispute_ratio': ['mean', 'median'],
    'mr_receiving_rate': ['mean', 'median'],
    'total_overpay': 'sum',
    'total_recovery': 'sum',
    'providertaxid': 'count'
}).reset_index()
bucket_summary.columns = ['_'.join(col).strip('_') for col in bucket_summary.columns.values]
bucket_summary

In [None]:
# Boxplot of hit_rate by bucket
plt.figure(figsize=(10, 6))
sns.boxplot(x='volume_bucket', y='hit_rate', data=df)
plt.title("Hit Rate by Volume Bucket")
plt.show()


In [None]:
# Correlation heatmap
metrics = ['hit_rate', 'recovery_rate', 'dispute_ratio', 'overturned_ratio', 'mr_receiving_rate', 'total_overpay', 'total_recovery']
plt.figure(figsize=(10, 8))
sns.heatmap(df[metrics].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()


In [None]:
# Flag outliers using IQR
def flag_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return ((series < (Q1 - 1.5 * IQR)) | (series > (Q3 + 1.5 * IQR)))

df['outlier_hit_rate'] = flag_outliers_iqr(df['hit_rate'])
df['outlier_recovery_rate'] = flag_outliers_iqr(df['recovery_rate'])
df['outlier_dispute'] = flag_outliers_iqr(df['dispute_ratio'])

# View flagged outliers
df[df[['outlier_hit_rate', 'outlier_recovery_rate', 'outlier_dispute']].any(axis=1)].head()


In [None]:
# Scatter plot: hit_rate vs recovery_rate
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='hit_rate', y='recovery_rate', hue='volume_bucket')
plt.title("Recovery Rate vs Hit Rate by Volume Bucket")
plt.show()


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

# Load dataset
# df = pd.read_csv('your_data.csv')

# Derived columns
df['total_audits'] = df['findings'] + df['no_findings']
df['hit_rate'] = df['findings'] / df['total_audits']
df['volume_bucket'] = pd.Categorical(df['volume_bucket'], 
    categories=["0-5", "5-15", "15-50", "50-100", "100+"], ordered=True)

# Summary by volume bucket
bucket_summary = df.groupby('volume_bucket').agg({
    'hit_rate': ['mean', 'median', 'std'],
    'dispute_ratio': ['mean', 'median', 'std'],
    'overturned_ratio': ['mean', 'median', 'std'],
    'recovery_rate': ['mean', 'median', 'std'],
    'mr_receiving_rate': ['mean', 'median', 'std'],
    'total_overpay': 'sum',
    'total_recovery': 'sum',
    'providertaxid': 'count',
    'cancelled_count': 'sum'
}).reset_index()

# Distribution plots
for col in ['hit_rate', 'dispute_ratio', 'recovery_rate', 'mr_receiving_rate']:
    sns.histplot(df[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

# Correlation heatmap
corr = df[['hit_rate', 'recovery_rate', 'dispute_ratio', 'overturned_ratio', 'mr_receiving_rate', 'total_overpay', 'total_recovery']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Volume vs performance
high_vol_df = df[df['volume_bucket'].isin(["15-50", "50-100", "100+"])]
sns.scatterplot(data=high_vol_df, x='total_audits', y='hit_rate')
plt.title("Total Audits vs Hit Rate")
plt.show()

sns.scatterplot(data=high_vol_df, x='total_overpay', y='recovery_rate')
plt.title("Total Overpay vs Recovery Rate")
plt.show()

# Outlier detection
def find_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return (series < lower) | (series > upper)

df['outlier_hit_rate'] = find_outliers_iqr(df['hit_rate'])
df['outlier_recovery_rate'] = find_outliers_iqr(df['recovery_rate'])
df['outlier_dispute_ratio'] = find_outliers_iqr(df['dispute_ratio'])

# Rule-based risk flags
df['risk_high_hit_low_recovery'] = (df['hit_rate'] > 0.8) & (df['recovery_rate'] < 0.3)
df['risk_high_dispute_high_overturn'] = (df['dispute_ratio'] > 0.5) & (df['overturned_ratio'] > 0.5)
df['risk_low_mrr'] = df['mr_receiving_rate'] < 0.5

df['risk_label'] = 'Normal'
df.loc[df['risk_high_hit_low_recovery'], 'risk_label'] = 'HighHit_LowRecovery'
df.loc[df['risk_high_dispute_high_overturn'], 'risk_label'] = 'HighDispute_HighOverturn'
df.loc[df['risk_low_mrr'], 'risk_label'] = 'LowMRR'

# Top outliers table
outlier_providers = df[df[['outlier_hit_rate', 'outlier_recovery_rate', 'outlier_dispute_ratio']].any(axis=1)]
top_outliers = outlier_providers.sort_values('total_overpay', ascending=False).head(10)
print(top_outliers[['providertaxid', 'hit_rate', 'recovery_rate', 'dispute_ratio', 'risk_label']])

# Scatter plot: hit_rate vs recovery_rate
fig = px.scatter(df, x='hit_rate', y='recovery_rate', color='volume_bucket',
                 hover_data=['providertaxid'], title="Hit Rate vs Recovery Rate")
fig.show()

# Regression plot
sns.lmplot(data=df, x='hit_rate', y='recovery_rate', line_kws={"color": "red"})
plt.title("Regression: Hit Rate vs Recovery Rate")
plt.show()

# Clustering (optional)
features = df[['hit_rate', 'recovery_rate', 'mr_receiving_rate', 'dispute_ratio']].dropna()
scaled = StandardScaler().fit_transform(features)
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(scaled)

sns.scatterplot(data=df, x='hit_rate', y='recovery_rate', hue='cluster', palette='Set2')
plt.title("KMeans Clustering of Providers")
plt.show()


📊 EXPLANATION OF CHARTS USED IN YOUR PROVIDER EDA
Chart Type	Used For	Best For Interpreting	Your Use Case
🔲 Boxplot	Distribution, outliers, comparison across categories	Spread, median, and presence of extreme values across volume buckets	Compare metrics like hit_rate, recovery_rate, dispute_ratio across volume_bucket — check if low-volume providers behave differently
📈 Violin Plot (optional alt to boxplot)	Shape + distribution	Skew, multi-modal data	Shows more distributional nuance than boxplots
📊 Bar Plot	Count or average comparison	Total providers, avg rates by group	Count of providers per volume_bucket, or average dispute_ratio per bucket
📉 Histogram	Frequency distribution of continuous variables	Understand data distribution shape	Check if hit_rate, recovery_rate, dispute_ratio are normally distributed or skewed — helps in detecting unusual patterns
🔥 Correlation Heatmap	Relationships between numeric variables	Strength & direction of linear correlation	Understand if hit_rate relates to recovery_rate, dispute_ratio, or MRR — helps in prioritizing metrics
🧮 Scatter Plot (Matplotlib/Seaborn)	Bi-variate relationships	Patterns or clusters between two metrics	Key for analyzing hit_rate vs recovery_rate, or total_audits vs hit_rate
🔍 Plotly Scatter (Interactive)	Zoomable, filterable visualization	Dynamic deep-dive	Lets you see relationships + outliers interactively, colored by volume_bucket
📐 Regression Plot (sns.lmplot)	Fit a trendline	Strength + direction of impact	E.g., does hit_rate drive recovery_rate up? — informs recovery strategy
📊 Table of Top Outliers	Summary of flagged providers	Direct reporting or escalation	Helps identify providers with strange or risky behavior
🔵 Clustered Scatter Plot	Cluster analysis	Behavioral segmentation	Visualizes groups like compliant vs risky providers based on clustering (KMeans)

✅ WHICH CHARTS ARE MOST USEFUL FOR YOU NOW?
1. 📊 Bar Plot: Count of Providers per Volume Bucket
Shows how your providers are spread (e.g., 2100+ in 0–5 audits).

Use this to justify separating analysis by audit volume.

2. 🔲 Boxplot / Violin: Key Metrics by Volume Bucket
Compare hit_rate, recovery_rate, dispute_ratio, etc. across buckets.

Use this to detect whether high-volume providers behave differently.

3. 📉 Histogram:
For each key metric: hit_rate, recovery_rate, dispute_ratio, mr_receiving_rate.

Use this to detect skewness, outliers, or irregular patterns.

4. 🔥 Correlation Heatmap:
Understand which metrics move together.

Use this to build feature understanding for further modeling or clustering.

5. 📈 Scatter Plots (hit_rate vs recovery_rate, total_audits vs hit_rate):
Visualize whether there's a trend or clustering of poor performers.

Use this to isolate providers who show high hit rate but poor recovery (potential fraud).

6. 📊 Table of Outliers / Risk Flags
Quickly see which providers are behaving abnormally.

Use this to prepare for reporting, escalation, or re-audit.

7. 📉 Regression Plot (lmplot):
Fit a line to see relationship between hit_rate and recovery_rate.

Use this to understand how much one metric influences another.

8. 🔵 Clustering Plot:
Behavioral segmentation using clustering.

Use this to group providers into compliant / risky / abnormal groups.

🧠 How to Interpret These in Your Use Case
Metric	If High	If Low	Insight
hit_rate	Likely accurate auditing or fraud-prone	Audits not yielding issues	High + low recovery = possible fraud
recovery_rate	Good collections	Poor collection or disputes	High hit_rate + low recovery_rate = leakage
dispute_ratio	Aggressive providers	Compliant	Combine with overturned_ratio to see if justified
overturned_ratio	Providers win disputes	Audits mostly correct	High value = review audit accuracy
mr_receiving_rate	Good cooperation	Compliance risk	Low MRR = audit resistance or delay tactics