# 🩺 Insurance Cost Analysis: Profit Logic & Social Vulnerability

This portfolio project explores how medical insurance pricing in the U.S. reflects profit-driven logic by disproportionately affecting socially vulnerable groups such as women, older individuals, and smokers.

We use Python to analyze real-world data and reveal how certain features influence charges and reinforce bias. The project concludes with ethical insights into how algorithmic pricing can reproduce structural inequalities.


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("insurance.csv")
df.head()

In [None]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import json

class FixedInsuranceAnalysis:
    def __init__(self, csv_path):
        self.df = pd.read_csv(csv_path)

    def summarize_data(self):
        return self.df.describe(include="all")

    def charges_by_gender(self):
        return self.df.groupby("sex")["charges"].agg(["mean", "median", "count"]).round(2)

    def charges_by_age_bins(self, bins=[18, 30, 45, 60, 65]):
        self.df["age_group"] = pd.cut(self.df["age"], bins)
        result = self.df.groupby("age_group")["charges"].agg(["mean", "count"]).round(2)
        result.index = result.index.astype(str)
        return result

    def charges_by_smoking_status(self):
        return self.df.groupby("smoker")["charges"].agg(["mean", "median", "count"]).round(2)

    def visualize_charges_by_category(self, category):
        if category in self.df.columns:
            plt.figure(figsize=(8, 5))
            sns.boxplot(data=self.df, x=category, y="charges", palette="Set3")
            plt.title(f"Insurance Charges by {category.capitalize()}")
            plt.ylabel("Charges ($)")
            plt.tight_layout()
            plt.show()
        else:
            print(f"'{category}' is not a valid column.")

    def correlation_matrix(self):
        return self.df.corr(numeric_only=True)

    def regression_inputs(self):
        df_encoded = pd.get_dummies(self.df, drop_first=True)
        X = df_encoded.drop("charges", axis=1)
        y = df_encoded["charges"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        model = LinearRegression()
        model.fit(X_train, y_train)
        r2_score = model.score(X_test, y_test)
        return {
            "r2_score": round(r2_score, 3),
            "coefficients": dict(zip(X.columns, model.coef_))
        }

    def top_influential_factors(self, top_n=5):
        model_results = self.regression_inputs()
        coef_dict = model_results["coefficients"]
        sorted_coef = sorted(coef_dict.items(), key=lambda item: abs(item[1]), reverse=True)
        return sorted_coef[:top_n]

    def discuss_bias(self):
        notes = [
            "⚠️ The dataset reflects binary gender (male/female), ignoring non-binary identities.",
            "⚠️ Charges for smokers are extremely inflated, likely based on profit risk rather than actual cost.",
            "⚠️ Charges rise with age — potentially penalizing individuals as they become less economically productive.",
            "⚠️ Regional pricing may reflect systemic inequalities in healthcare access and provider markup.",
            "⚠️ Predictive models based on these features may reinforce existing social biases in automated insurance decisions.",
        ]
        return notes

    def summarize_findings(self):
        return {
            "gender_bias": self.charges_by_gender().to_dict(),
            "age_groups": self.charges_by_age_bins().to_dict(),
            "smoker_markup": self.charges_by_smoking_status().to_dict(),
            "correlation_matrix": self.correlation_matrix().to_dict(),
            "regression_summary": self.regression_inputs()
        }

    def export_report(self, filename="insurance_analysis_report.json"):
        report = {
            "summary": self.summarize_findings(),
            "top_factors": self.top_influential_factors(),
            "bias_notes": self.discuss_bias()
        }
        with open(filename, "w") as f:
            json.dump(report, f, indent=2)
        return f"Report saved to {filename}"


In [None]:
analysis = FixedInsuranceAnalysis("insurance.csv")

In [None]:
# 📊 Gender-Based Insurance Charges
analysis.charges_by_gender()

In [None]:
# 📊 Age-Based Insurance Charges
analysis.charges_by_age_bins()

In [None]:
# 📊 Smoker vs Non-Smoker Charges
analysis.charges_by_smoking_status()

In [None]:
# 📈 Correlation Matrix
analysis.correlation_matrix()

In [None]:
# 🔍 Most Influential Factors (Regression)
analysis.top_influential_factors()

In [None]:
# ⚠️ Bias Discussion
for note in analysis.discuss_bias():
    print(note)

## ✅ Conclusion

This analysis shows that medical insurance pricing does not only reflect health risks — it monetizes **social vulnerabilities** such as:

- Gender (despite wage inequality)
- Age (despite lower income capacity with aging)
- Smoking (used as a proxy for socioeconomic status)

**Algorithmic systems trained on such biased data can reinforce systemic inequality**, making it essential to audit how these models are built and used.

---

### 💾 Files Included:
- `Insurance_Profit_Logic_Analysis.ipynb`: This notebook
- `insurance.csv`: Dataset used for analysis
- `insurance_analysis_report.json`: Structured output (optional)
