# Healthcare Cost Prediction Dashboard
## Interactive Analysis & Visualization Notebook

---

**Project Overview:**  
This notebook provides an interactive interface for the Healthcare Cost Prediction project. It leverages the comprehensive `healthcare_analysis.py` module to perform predictions and generate insights.

**Key Features:**
- Imports reusable functions from `healthcare_analysis.py`
- Interactive exploratory data analysis
- Custom visualizations and insights
- Quick prediction interface
- Extended analysis capabilities

---

## 1. Setup & Imports

We'll import all necessary libraries and functions from our main analysis module.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import functions from our healthcare_analysis module
from healthcare_analysis import (
    load_and_clean_data,
    load_validation_data,
    prepare_features,
    train_model,
    predict_validation,
    create_dashboard
)

# Configure plotting
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("‚úì All imports successful!")
print("‚úì Ready to use functions from healthcare_analysis.py")

## 2. Quick Start - Full Analysis Pipeline

Run the complete analysis pipeline using the main module.

In [None]:
# Option 1: Run the entire pipeline from the module
from healthcare_analysis import main

# Execute full analysis
df, val_df, predictions, metrics = main()

print("\n" + "="*60)
print("‚úì Full analysis pipeline completed!")
print("‚úì Dashboard saved as 'healthcare_dashboard.png'")
print("‚úì Predictions saved as 'validation_predictions.csv'")
print("="*60)

## 3. Interactive Data Exploration

Now let's explore the data interactively using the loaded datasets.

In [None]:
# Display basic information
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"\nTraining Data Shape: {df.shape}")
print(f"Validation Data Shape: {val_df.shape}")
print(f"\nTraining Data Info:")
display(df.head(10))

print("\n" + "="*60)
print("STATISTICAL SUMMARY")
print("="*60)
display(df.describe())

### 3.1 Distribution Analysis

In [None]:
# Create custom distribution plots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Key Feature Distributions', fontsize=16, fontweight='bold')

# Age distribution
axes[0, 0].hist(df['age'], bins=30, color='#2E86AB', edgecolor='white', alpha=0.7)
axes[0, 0].axvline(df['age'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["age"].mean():.1f}')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# BMI distribution
axes[0, 1].hist(df['bmi'], bins=30, color='#A23B72', edgecolor='white', alpha=0.7)
axes[0, 1].axvline(df['bmi'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df["bmi"].mean():.1f}')
axes[0, 1].set_xlabel('BMI')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('BMI Distribution')
axes[0, 1].legend()

# Charges distribution
axes[1, 0].hist(df['charges'], bins=40, color='#F18F01', edgecolor='white', alpha=0.7)
axes[1, 0].axvline(df['charges'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: ${df["charges"].mean():,.0f}')
axes[1, 0].set_xlabel('Charges ($)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Charges Distribution')
axes[1, 0].legend()

# Smoker distribution
smoker_counts = df['smoker'].value_counts()
axes[1, 1].bar(smoker_counts.index, smoker_counts.values, color=['#27ae60', '#e74c3c'], edgecolor='white')
axes[1, 1].set_xlabel('Smoker Status')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Smoker Distribution')

plt.tight_layout()
plt.show()

### 3.2 Categorical Analysis

In [None]:
# Analyze charges by different categories
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Healthcare Charges by Category', fontsize=16, fontweight='bold')

# 1. By Smoker Status
smoker_charges = df.groupby('smoker')['charges'].mean().sort_values(ascending=False)
bars = axes[0, 0].bar(smoker_charges.index, smoker_charges.values, color=['#e74c3c', '#27ae60'], edgecolor='white', linewidth=2)
axes[0, 0].set_ylabel('Average Charges ($)', fontsize=11)
axes[0, 0].set_title('Impact of Smoking on Healthcare Costs', fontsize=12, fontweight='bold')
for bar, val in zip(bars, smoker_charges.values):
    axes[0, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 500,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

# 2. By Age Group
df_temp = df.copy()
df_temp['age_group'] = pd.cut(df_temp['age'], bins=[0, 25, 35, 45, 55, 65, 100],
                               labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])
age_charges = df_temp.groupby('age_group')['charges'].mean()
bars = axes[0, 1].bar(age_charges.index, age_charges.values, color='#2E86AB', edgecolor='white', linewidth=2)
axes[0, 1].set_ylabel('Average Charges ($)', fontsize=11)
axes[0, 1].set_title('Healthcare Costs by Age Group', fontsize=12, fontweight='bold')
axes[0, 1].tick_params(axis='x', rotation=45)
for bar, val in zip(bars, age_charges.values):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=9)

# 3. By Region
region_charges = df.groupby('region')['charges'].mean().sort_values(ascending=False)
region_colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
bars = axes[1, 0].bar(region_charges.index, region_charges.values, color=region_colors[:len(region_charges)], edgecolor='white', linewidth=2)
axes[1, 0].set_ylabel('Average Charges ($)', fontsize=11)
axes[1, 0].set_title('Regional Healthcare Cost Variations', fontsize=12, fontweight='bold')
for bar, val in zip(bars, region_charges.values):
    axes[1, 0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=9)

# 4. By Sex
sex_charges = df.groupby('sex')['charges'].mean()
bars = axes[1, 1].bar(sex_charges.index, sex_charges.values, color=['#e91e63', '#2196f3'], edgecolor='white', linewidth=2)
axes[1, 1].set_ylabel('Average Charges ($)', fontsize=11)
axes[1, 1].set_title('Healthcare Costs by Sex', fontsize=12, fontweight='bold')
for bar, val in zip(bars, sex_charges.values):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

### 3.3 Correlation Analysis

In [None]:
# Create correlation heatmap
df_corr = df.copy()
df_corr['sex_encoded'] = df_corr['sex'].map({'male': 1, 'female': 0})
df_corr['smoker_encoded'] = df_corr['smoker'].map({'yes': 1, 'no': 0})

correlation_cols = ['age', 'bmi', 'children', 'sex_encoded', 'smoker_encoded', 'charges']
correlation_matrix = df_corr[correlation_cols].corr()

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Display correlation with charges
print("="*60)
print("CORRELATION WITH CHARGES")
print("="*60)
charges_corr = correlation_matrix['charges'].sort_values(ascending=False)
for feature, corr in charges_corr.items():
    if feature != 'charges':
        print(f"{feature:20s}: {corr:+.4f}")

### 3.4 Advanced Scatter Plots

In [None]:
# BMI vs Charges and Age vs Charges colored by smoker status
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

colors_scatter = df['smoker'].map({'yes': '#e74c3c', 'no': '#27ae60'})

# BMI vs Charges
axes[0].scatter(df['bmi'], df['charges'], c=colors_scatter, alpha=0.5, s=50, edgecolors='white', linewidth=0.5)
axes[0].set_xlabel('BMI', fontsize=12)
axes[0].set_ylabel('Charges ($)', fontsize=12)
axes[0].set_title('BMI vs Charges (Red=Smoker, Green=Non-Smoker)', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Age vs Charges
axes[1].scatter(df['age'], df['charges'], c=colors_scatter, alpha=0.5, s=50, edgecolors='white', linewidth=0.5)
axes[1].set_xlabel('Age', fontsize=12)
axes[1].set_ylabel('Charges ($)', fontsize=12)
axes[1].set_title('Age vs Charges (Red=Smoker, Green=Non-Smoker)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Model Performance Analysis

Examine the model's performance in detail.

In [None]:
# Display model metrics
print("="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print(f"\nR¬≤ Score:        {metrics['r2']:.4f}")
print(f"RMSE:            ${metrics['rmse']:,.2f}")
print(f"MAE:             ${metrics['mae']:,.2f}")
print(f"CV Mean:         {metrics['cv_mean']:.4f}")
print(f"CV Std:          {metrics['cv_std']:.4f}")

# Interpretation
print("\n" + "="*60)
print("MODEL INTERPRETATION")
print("="*60)
print(f"\nThe model explains {metrics['r2']*100:.2f}% of the variance in healthcare costs.")
print(f"On average, predictions are off by ${metrics['mae']:,.2f}.")

if metrics['r2'] > 0.85:
    print("\n‚úÖ Model Performance: EXCELLENT")
    print("   The model is highly reliable for cost predictions.")
elif metrics['r2'] > 0.75:
    print("\n‚úÖ Model Performance: GOOD")
    print("   The model performs well and can be used for predictions.")
else:
    print("\n‚ö†Ô∏è  Model Performance: MODERATE")
    print("   Consider additional features or model tuning.")

## 5. Validation Predictions Analysis

In [None]:
# Examine validation predictions
print("="*60)
print("VALIDATION PREDICTIONS SUMMARY")
print("="*60)
print(f"\nTotal Predictions:  {len(predictions)}")
print(f"Minimum:            ${predictions.min():,.2f}")
print(f"Maximum:            ${predictions.max():,.2f}")
print(f"Mean:               ${predictions.mean():,.2f}")
print(f"Median:             ${np.median(predictions):,.2f}")
print(f"Standard Deviation: ${predictions.std():,.2f}")

# Load the predictions CSV to display
predictions_df = pd.read_csv('validation_predictions.csv')
print("\n" + "="*60)
print("SAMPLE PREDICTIONS")
print("="*60)
display(predictions_df.head(15))

### 5.1 Validation Predictions Visualization

In [None]:
# Visualize validation predictions
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Validation Predictions Analysis', fontsize=16, fontweight='bold')

# 1. Predictions distribution
axes[0, 0].hist(predictions, bins=40, color='#27ae60', edgecolor='white', alpha=0.7)
axes[0, 0].axvline(predictions.mean(), color='red', linestyle='--', linewidth=2, 
                   label=f'Mean: ${predictions.mean():,.0f}')
axes[0, 0].set_xlabel('Predicted Charges ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Predicted Charges')
axes[0, 0].legend()

# 2. By smoker status
val_output = val_df.copy()
val_output['predicted_charges'] = predictions
smoker_pred = val_output.groupby('smoker')['predicted_charges'].mean().sort_values(ascending=False)
bars = axes[0, 1].bar(smoker_pred.index, smoker_pred.values, color=['#e74c3c', '#27ae60'], edgecolor='white', linewidth=2)
axes[0, 1].set_ylabel('Mean Predicted Charges ($)')
axes[0, 1].set_title('Predicted Charges by Smoking Status')
for bar, val in zip(bars, smoker_pred.values):
    axes[0, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# 3. Age distribution comparison
axes[1, 0].hist(df['age'], bins=25, alpha=0.6, label='Training', color='#2E86AB', edgecolor='white')
axes[1, 0].hist(val_df['age'], bins=25, alpha=0.6, label='Validation', color='#F18F01', edgecolor='white')
axes[1, 0].set_xlabel('Age')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Age Distribution: Training vs Validation')
axes[1, 0].legend()

# 4. Predicted charges by region
region_pred = val_output.groupby('region')['predicted_charges'].mean().sort_values(ascending=False)
bars = axes[1, 1].bar(region_pred.index, region_pred.values, color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'][:len(region_pred)], edgecolor='white', linewidth=2)
axes[1, 1].set_ylabel('Mean Predicted Charges ($)')
axes[1, 1].set_title('Predicted Charges by Region')
for bar, val in zip(bars, region_pred.values):
    axes[1, 1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 200,
                    f'${val:,.0f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## 6. Custom Predictions

Make predictions for custom scenarios.

In [None]:
# You can retrain a model if needed
# This loads fresh data and trains a new model

print("Training a fresh model for custom predictions...")
model, model_metrics, X_test, y_test, y_pred, feature_list = train_model(df)

print("\n‚úì Model trained successfully!")
print(f"   R¬≤ Score: {model_metrics['r2']:.4f}")
print(f"   RMSE: ${model_metrics['rmse']:,.2f}")

In [None]:
# Create custom prediction scenarios
# Example: Compare costs for different scenarios

scenarios = pd.DataFrame([
    {'age': 25, 'sex': 'male', 'bmi': 22.0, 'children': 0, 'smoker': 'no', 'region': 'southwest'},
    {'age': 25, 'sex': 'male', 'bmi': 22.0, 'children': 0, 'smoker': 'yes', 'region': 'southwest'},
    {'age': 45, 'sex': 'female', 'bmi': 28.0, 'children': 2, 'smoker': 'no', 'region': 'northeast'},
    {'age': 45, 'sex': 'female', 'bmi': 28.0, 'children': 2, 'smoker': 'yes', 'region': 'northeast'},
    {'age': 60, 'sex': 'male', 'bmi': 32.0, 'children': 0, 'smoker': 'no', 'region': 'southeast'},
    {'age': 60, 'sex': 'male', 'bmi': 32.0, 'children': 0, 'smoker': 'yes', 'region': 'southeast'},
])

scenarios['scenario_name'] = [
    'Young, Healthy, Non-Smoker',
    'Young, Healthy, Smoker',
    'Middle-aged, Family, Non-Smoker',
    'Middle-aged, Family, Smoker',
    'Senior, Overweight, Non-Smoker',
    'Senior, Overweight, Smoker'
]

# Make predictions for scenarios
scenario_processed, scenario_predictions = predict_validation(model, scenarios, feature_list)

scenarios['predicted_cost'] = scenario_predictions

print("="*80)
print("COST COMPARISON ACROSS SCENARIOS")
print("="*80)
display(scenarios[['scenario_name', 'age', 'bmi', 'smoker', 'predicted_cost']].style.format({
    'predicted_cost': '${:,.2f}'
}))

# Visualize scenario comparisons
fig, ax = plt.subplots(figsize=(12, 6))
colors_scenario = ['#27ae60' if 'Non-Smoker' in name else '#e74c3c' for name in scenarios['scenario_name']]
bars = ax.bar(range(len(scenarios)), scenarios['predicted_cost'], color=colors_scenario, edgecolor='white', linewidth=2)
ax.set_xticks(range(len(scenarios)))
ax.set_xticklabels(scenarios['scenario_name'], rotation=45, ha='right')
ax.set_ylabel('Predicted Annual Cost ($)', fontsize=12)
ax.set_title('Healthcare Cost Comparison Across Different Scenarios', fontsize=14, fontweight='bold')

for bar, val in zip(bars, scenarios['predicted_cost']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 500,
            f'${val:,.0f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

## 7. Key Insights Summary

In [None]:
# Generate comprehensive insights
print("="*70)
print("KEY INSIGHTS FROM HEALTHCARE COST ANALYSIS")
print("="*70)

# Smoking Impact
smoker_yes_avg = df[df['smoker']=='yes']['charges'].mean()
smoker_no_avg = df[df['smoker']=='no']['charges'].mean()
smoker_multiplier = smoker_yes_avg / smoker_no_avg

print("\nüö¨ SMOKING IMPACT:")
print("-" * 70)
print(f"   ‚Ä¢ Smokers pay {smoker_multiplier:.2f}x more than non-smokers")
print(f"   ‚Ä¢ Average smoker cost: ${smoker_yes_avg:,.2f}")
print(f"   ‚Ä¢ Average non-smoker cost: ${smoker_no_avg:,.2f}")
print(f"   ‚Ä¢ Cost difference: ${smoker_yes_avg - smoker_no_avg:,.2f}")

# Age Impact
df_age = df.copy()
df_age['age_group'] = pd.cut(df_age['age'], bins=[0, 25, 35, 45, 55, 65, 100],
                              labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+'])
age_charges = df_age.groupby('age_group')['charges'].mean()

print("\nüë§ AGE IMPACT:")
print("-" * 70)
print(f"   ‚Ä¢ Youngest (18-25): ${age_charges.iloc[0]:,.2f}")
print(f"   ‚Ä¢ Oldest (65+): ${age_charges.iloc[-1]:,.2f}")
print(f"   ‚Ä¢ Increase with age: ${age_charges.iloc[-1] - age_charges.iloc[0]:,.2f}")

# BMI Impact
print("\n‚öñÔ∏è  BMI IMPACT:")
print("-" * 70)
df_bmi = df.copy()
df_bmi['bmi_cat'] = pd.cut(df_bmi['bmi'], bins=[0, 25, 30, 100],
                           labels=['Normal/Underweight', 'Overweight', 'Obese'])
bmi_charges = df_bmi.groupby('bmi_cat')['charges'].mean()
for cat, charge in bmi_charges.items():
    print(f"   ‚Ä¢ {cat}: ${charge:,.2f}")

print("\nüí° TOP RECOMMENDATIONS:")
print("-" * 70)
print("   1. Prioritize smoking cessation programs (biggest cost driver)")
print("   2. Implement age-based preventive care strategies")
print("   3. Promote healthy BMI through wellness programs")
print("   4. Use predictive model for personalized pricing")
print("   5. Focus interventions on high-risk groups")

print("\n" + "="*70)

## 8. Export & Next Steps

In [None]:
print("="*60)
print("ANALYSIS OUTPUTS")
print("="*60)
print("\n‚úì Files Generated:")
print("   1. healthcare_dashboard.png - Comprehensive visual dashboard")
print("   2. validation_predictions.csv - All validation predictions")
print("\n‚úì Available in Memory:")
print("   - df: Training dataset")
print("   - val_df: Validation dataset")
print("   - model: Trained prediction model")
print("   - predictions: Validation predictions array")
print("   - metrics: Model performance metrics")

print("\n" + "="*60)
print("NEXT STEPS")
print("="*60)
print("""
1. Review the comprehensive dashboard (healthcare_dashboard.png)
2. Examine validation predictions (validation_predictions.csv)
3. Use the model for custom predictions
4. Implement business recommendations
5. Monitor model performance over time
6. Retrain with new data as needed
""")

print("="*60)
print("‚úì Analysis Complete! üéâ")
print("="*60)

---

## Additional Notes

### Reusable Functions from healthcare_analysis.py:

- `load_and_clean_data()` - Load and clean training data
- `load_validation_data()` - Load and clean validation data
- `prepare_features(df)` - Engineer features for a dataset
- `train_model(df)` - Train the prediction model
- `predict_validation(model, df, features)` - Make predictions
- `create_dashboard(...)` - Generate comprehensive dashboard
- `main()` - Run complete pipeline

### Tips:
- All functions are imported and ready to use
- No need to rewrite data loading or model training code
- Focus on exploration, visualization, and insights
- Easily extend with custom analysis

---
*Healthcare Cost Prediction - Interactive Analysis Notebook*