## Summary Stats

##### 📊 Continuous Variables

**Age**

* Average age is around 42, with a balanced spread (18–66) and no strong skew.
* This indicates a representative adult population with both younger and older insured individuals.
* Since premiums typically rise with age, the wide coverage ensures age remains an important predictor for risk segmentation.

**Height**

* Heights center around 168 cm with little skew, spanning from 145 cm to 188 cm.
* Height on its own is less meaningful, but in combination with weight it helps calculate BMI.
* BMI can reveal obesity risks, which directly influence health outcomes and hence premiums.

**Weight**

* Average weight is \~77 kg, moderately right-skewed, with some heavier outliers up to 132 kg.
* This suggests a portion of the population is overweight/obese, which may correlate with lifestyle diseases.
* Higher weight relative to height (BMI) should be treated as a health risk driver in premium pricing.

**Number of Major Surgeries**

* Most people have 0–1 surgeries, but a few report up to 3, with a skew toward fewer surgeries.
* This distribution reflects that surgeries are relatively rare but highly impactful.
* Prior surgeries indicate higher medical risk, so this variable likely adds strong weight in premium adjustments.

**Premium Price**

* Average premium is \~₹24,300, ranging from ₹15,000 to ₹40,000, with slight right skew.
* This shows a structured pricing band with no extreme outliers, reflecting standard underwriting rules.
* Premium is sensitive to medical history and demographics, so this distribution provides the baseline cost for prediction.

---

##### ⚖️ Binary Variables

**Diabetes**

* Around 42% of the insured population has diabetes.
* This is a significant portion, showing diabetes is common in the dataset.
* Since diabetes is a chronic condition, it adds substantial long-term cost, raising premium risk.

**Blood Pressure Problems**

* Nearly 47% report hypertension issues.
* This high prevalence indicates cardiovascular risks are widespread.
* Strong predictor of claims, so critical for premium modeling.

**Any Transplants**

* Only about 6% have undergone transplants.
* Rare but very high-risk cases.
* Even though the sample is small, transplant history justifies very high premium loadings.

**Any Chronic Diseases**

* About 18% report chronic diseases other than diabetes/BP.
* Reflects a smaller but costly subset of policyholders.
* Inclusion of this variable sharpens risk categorization for sustained medical expenses.

**Known Allergies**

* 21% report allergies.
* Allergies can range from minor to severe, but in aggregate they suggest higher medical sensitivity.
* May serve as a secondary risk factor, moderately influencing premium.

**History of Cancer in Family**

* Around 12% have family cancer history.
* This indicates potential genetic predisposition to high-cost illnesses.
* Even as a family history (not diagnosis), insurers may treat this as an early-warning risk flag.

## Distribution Analysis

#### 🔎 Insights

- Age is evenly spread across adulthood, with no extreme skew, indicating balanced representation of young and older individuals.

- Height follows a near-normal spread around 168 cm.

- Weight shows right skew, reflecting a notable overweight/obese subset.

- Premium prices cluster between ₹15k–₹30k, with a few higher outliers, showing standard pricing with risk-based adjustments.

#### 🔎 Insights

* Age, height, and weight are fairly compact with few visible outliers.

* Premium prices show a wide range, with several higher-end outliers beyond ₹35k.

* This suggests that while demographics are stable, premiums vary more due to health conditions.

* Outliers in premium highlight customers with higher risk profiles driving up costs.



#### 🔎 Insights

* Diabetes and blood pressure problems affect a large share of the population (40–45%).

* Transplants are rare (\~6%) but critical for cost implications.

* Allergies and chronic diseases affect about 20%, while family cancer history is at 12%.

* These conditions add significant heterogeneity in health risk, influencing premium differentiation.

#### 🔎 Insights

* Nearly half have no surgeries, and \~38% report exactly one.

* Multiple surgeries (2–3) are rare but indicate high medical risk.

* This reflects that surgeries are infrequent events but carry heavy cost weight.

* The skew suggests the variable is important for identifying high-cost outliers.

#### 🔎 Insights

##### BMI Categories

* Distribution is well spread, with \~32% normal, 33% overweight, and 31% obese.

* Very few fall under the underweight group (4%).

* High overweight/obesity rates suggest lifestyle-related risks are common.

* Since BMI strongly correlates with chronic diseases, it is a major premium driver.



##### Age Groups

* Largest share of insured falls in the 18–29 and 40–49 ranges (\~24% each).

* Older groups (50–59 and 60+) are smaller in count (\~32% combined).

* This suggests a balanced pool with both young, low-risk customers and older, high-risk customers.

* The age group spread ensures premiums must adjust strongly with age bands.



#### 🔎 Insights

* Premiums rise steadily with age, starting from an average of \~₹16.4k in the 18–29 group to \~₹28.8k in the 60+ group.

* This shows a strong positive relationship between age and insurance cost, reflecting that older individuals are considered riskier due to higher likelihood of health issues.

* It highlights age as one of the most powerful predictors in premium estimation, making age-group-based pricing essential for accurate risk adjustment.



#### 🔎 Insights

* Individuals without chronic diseases pay lower average premiums (~₹23.7k) compared to those with chronic conditions (~₹27.1k).

* This indicates that chronic illnesses significantly increase perceived health risk, leading insurers to charge higher premiums.

* It reinforces the importance of medical history in premium pricing, as chronic conditions signal sustained long-term healthcare costs.

## Correlation Analysis


#### 🔎 Insights

* Most individuals with chronic diseases (156 out of 178) do not report a family cancer history, while only 22 report both risk factors together.

* This shows that chronic conditions and family cancer history are largely independent, with limited overlap in the dataset.

* From a premium perspective, the small group having both factors represents a concentrated high-risk segment that could justify significantly higher pricing adjustments.


#### 🔎 Insights

* Age shows the strongest correlation with premium (0.70), confirming it as the dominant driver of insurance pricing.

* Medical risk factors such as transplants (0.29), number of surgeries (0.26), and chronic diseases (0.21) also have notable positive relationships, indicating higher health burdens push premiums upward.

* Lifestyle and genetic indicators like weight, BMI, blood pressure, and family cancer history show weaker correlations, but still contribute incremental risk signals.

* Allergies and height have almost no impact, suggesting they are not meaningful for premium prediction.

## Scatter plot

#### 🔎 Insights

* Scatter plots show that Age has the strongest positive correlation with Premium Price.

  * Premiums steadily rise with age, highlighting its dominant role in pricing.
  * This aligns with insurers’ risk assessment, as older individuals typically face higher health risks.

* Height, Weight, and BMI display only weak relationships with Premium Price.

  * These attributes add minor variability but are not strong standalone predictors.
  * Their effect is overshadowed by direct medical risk indicators.

* The Number of Major Surgeries shows a moderate upward trend with Premium Price.

  * Customers with more surgeries generally face higher premiums.
  * This reinforces the importance of medical history in driving costs.

**Overall, Age and medical history are the strongest predictors of premiums, while body metrics contribute only marginally.**

## Boxplot 

#### 🔎 Insights

* Health conditions such as **Diabetes, Blood Pressure Problems, Allergies, and Cancer History** show only modest differences in premium prices (4–9%).

  * These conditions raise premiums slightly but do not drastically alter pricing.
  * This indicates they are considered moderate risk factors in premium calculation.

* **Chronic Diseases** lead to a much higher premium impact (\~14.3%).

  * Customers with chronic illnesses consistently face elevated costs.
  * This reflects insurers’ recognition of ongoing medical expenses and higher long-term risks.

* **Organ Transplants** stand out with the largest effect (\~32%).

  * Premiums are significantly higher for customers with transplant history.
  * This condition represents a critical health risk, strongly influencing pricing decisions.

**Overall, severe and long-term health risks such as transplants and chronic diseases drive the largest premium differences, while moderate conditions show only incremental effects.**


## Pairplots

#### 🔎 Insights

* Individuals with diabetes show slightly higher BMI and weight clustering compared to non-diabetics.

  * This suggests a link between metabolic risk factors and diabetes occurrence.
  * Higher BMI-driven health risks can increase premium pricing, requiring insurers to account for obesity-related conditions.

* Blood pressure problems are more common in older age groups, but BMI and weight overlap heavily across both groups.

  * This indicates that age is a stronger differentiator than body composition for hypertension.
  * Premium adjustments for blood pressure should emphasize age-related risk rather than BMI alone.

* Transplants are rare but associated with very high premium clusters regardless of age or weight.

  * This highlights that the medical event itself, rather than lifestyle factors, drives premium costs.
  * Models should treat transplant history as a categorical high-risk driver, even if sample frequency is low.

* Chronic diseases are spread across all age groups but tend to align with higher premium bands.

  * This indicates a broad impact on insurance pricing irrespective of BMI or weight.
  * Chronic disease indicators should be prioritized as consistent premium escalators.

* Known allergies show minimal separation across age, BMI, or premium bands.

  * This suggests allergies alone may not significantly alter health risk or pricing.
  * They may only add predictive value when combined with other comorbidities.

* Family history of cancer shows moderate overlap with higher premiums but without strong clustering by BMI or weight.

  * This reflects genetic risk being independent of lifestyle variables.
  * Insurers may use family history as an additive but not dominant pricing factor.

---

**Overall, major health events like transplants and chronic diseases are the most consistent premium escalators, while lifestyle risks (BMI, weight) primarily act as amplifiers for diabetes and hypertension.**




## Outlier Analysis

#### 🔎 Insights

* Age and height show no outliers, indicating a clean and consistent distribution across the dataset.

* Weight has 16 outliers above the upper bound, suggesting a subset of individuals with significantly higher body mass, potentially reflecting obesity risks.

* Number of major surgeries also has 16 outliers, highlighting a small group with unusually high surgical history, which signals elevated health risks.

* Premium price has 6 outliers above ₹38.5k, representing customers with extreme risk profiles where insurers impose substantially higher charges.



## Zscore and IQR

#### 🔎 Insights

* Both methods agree that age and height are clean with no significant outliers, showing stable distributions.

* Weight and BMI emerge as the main sources of outliers, with IQR detecting more (22 BMI vs 7 in Z-Score), suggesting sensitivity to heavier tails in BMI distribution.

* Premium price shows a few high-end outliers (6 via IQR) but none flagged by Z-Score, indicating extreme premiums are still within statistical bounds when using standard deviation–based thresholds.

* Overall, the dataset has a very low percentage of outliers (<3%), all retained for modeling since they represent meaningful high-risk individuals rather than data errors.



## Hypothesis

#### 🔎 Insights

##### 1️⃣ Feature → Premium Price (Direct Impact)

**Significant Predictors (Premium varies by these):**

* **Age** → 🔥 Strongest predictor (Spearman r=0.739, p<0.0001). Premiums rise sharply with age.
* **Medical Conditions:**

  * Diabetes (p=0.006)
  * Blood Pressure Problems (p<0.0001)
  * Any Transplants (p<0.0001)
  * Any Chronic Diseases (p<0.0001)
  * History of Cancer in Family (p=0.0001)
* **Surgeries:**

  * Number of Major Surgeries (Kruskal–Wallis p<0.0001) → Premium stratification across groups.
* **Physical Measures:**

  * Weight (r=0.129, p=0.00005)
  * BMI (r=0.098, p=0.0021)

**Non-Significant (No Premium Effect):**

* Known Allergies (p=0.566)
* Height (r=0.023, p=0.468)

📌 *Conclusion*: Premium pricing is **dominated by age and critical health conditions**, not by benign traits like height or allergies.

---

##### 2️⃣ Feature ↔ Feature Dependencies (Chi-Square)

**Strong Dependence Detected:**

* **Diabetes ↔ Blood Pressure Problems** (p<0.0001)

* **Diabetes ↔ Chronic Diseases** (p=0.006)

* **Diabetes ↔ Known Allergies** (p=0.015)

* **Diabetes ↔ Major Surgeries** (p<0.0001)

* **Blood Pressure Problems ↔ Major Surgeries** (p<0.0001)

* **Known Allergies ↔ History of Cancer in Family** (p<0.0001)

* **Known Allergies ↔ Major Surgeries** (p<0.0001)

* **History of Cancer in Family ↔ Major Surgeries** (p<0.0001)

**Independent (No Significant Link):**

* Any Transplants ↔ most features (no strong dependence).
* Allergies ↔ Chronic Diseases (p=0.447, not linked).
* Cancer History ↔ Chronic Diseases (p=0.886, not linked).

📌 *Conclusion*: Some health risks **cluster together** (e.g., diabetes, BP, and surgeries often co-occur). But **transplants stand alone** as a rare but impactful risk factor.

---

##### 3️⃣ Integrated Insights

* **Premium Drivers:** Age + chronic/critical health events (BP, diabetes, cancer, surgeries).
* **Health Risk Clusters:** Diabetes, BP, and surgeries form a **high-risk group** with interdependencies.
* **Isolated Predictors:** Transplants have a **direct premium impact**, but no dependency with other conditions → treated as an independent red-flag by insurers.
* **Ignored Factors:** Height and allergies are not reflected in pricing (and in some cases, independent of other diseases).

## Modeling



#### 🔎📊 Final Model Comparison – Insights

| Model                                | Train R² | Test R²  | Test RMSE | Verdict                                                |
| ------------------------------------ | -------- | -------- | --------- | ------------------------------------------------------ |
| **Linear Regression**                | 0.70     | 0.79     | 3020      | Weak baseline, underfits                               |
| **Decision Tree (Tuned)**            | 0.83     | 0.88     | 2220      | Good, but prone to instability                         |
| **Random Forest (RandomSearch)**     | 0.82     | **0.89** | **2129**  | ✅ **Best Candidate (balance of accuracy & stability)** |
| **Gradient Boosting (RandomSearch)** | 0.85     | 0.88     | 2279      | Strong, slightly less stable than RF                   |
| **XGBoost (RandomSearch)**           | **1.00** | 0.86     | 2457      | Overfits badly (Train ≫ Test)                          |
| **LightGBM (RandomSearch)**          | 0.87     | 0.88     | 2239      | Excellent, very close to RF                            |
| **Neural Network (RandomSearch)**    | 0.70     | 0.78     | 3085      | Underperforms, not suitable for deployment             |

---

## 🔎 Key Takeaways

* **🏆 Best Performing Model → Random Forest (RandomSearch)**

  * Highest Test R² (**0.89**)
  * Lowest Test RMSE (**2129**)
  * Balanced Train vs Test → good generalization
  * Feature importances remain interpretable

* **Runner-up → LightGBM**

  * Nearly identical performance (Test R² = 0.88, RMSE = 2239)
  * Slightly more efficient for very large data, but in this dataset Random Forest edges it out.

* **Gradient Boosting** → reliable, but not as strong as RF/LightGBM.

* **XGBoost** → overfitting despite tuning (Train R² = 1.0). Risky for production.

* **Neural Network** → improved post-tuning, but still significantly weaker than tree ensembles.

* **Linear Regression** → serves as a baseline only.

---

## ✅ Final Verdict

We **choose Random Forest (RandomSearch tuned)** as the **final model for deployment**, given its superior performance, robustness, and interpretability.
👉 LightGBM may also be considered as an alternative for scalability.
