These statistical analyses optimize insurance pricing by **quantifying risk differences** (ANOVA/t-tests), **measuring cost impacts** (Cohen's d), and **validating pricing factors** (p-values). They enable data-driven premium adjustments—like surcharges for high-risk groups (smokers, specific regions)—while maintaining actuarial fairness. The result: **competitive yet profitable pricing** with transparent justification for customers and regulators.

## Key Steps

### Statistical Tests:

- ANOVA + Tukey HSD: Identify regional cost differences.

- T-test + Cohen’s d: Quantify smoker cost impact (practically significant!).

### Visualization:

- Use histograms for distributions, heatmaps for correlations.



# ANOVA for Regional Cost Analysis

## What is ANOVA?
**ANOVA (Analysis of Variance)** is a statistical method that compares means across three or more groups to determine if at least one group differs significantly from others.

### Key Hypotheses:
- **Null Hypothesis (H₀):** All group means are equal  
  *(Example: All regions have identical medical costs)*
- **Alternative Hypothesis (H₁):** At least one group mean differs  

---

## How to Use ANOVA for Regional Cost Analysis

### Step 1: Prepare the Data
Group medical charges by region:
```python
northeast = df[df['region_northeast'] == 1]['charges']
southeast = df[df['region_southeast'] == 1]['charges']
# Repeat for other regions
```
### Step 2: Run One-Way ANOVA
```python
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(northeast, northwest, southeast, southwest)
print(f"F-statistic: {f_stat:.2f}, p-value: {p_value:.4f}")
```
### Step 3: Interpret Results

| Metric       | Threshold | Conclusion                          |
|--------------|-----------|-------------------------------------|
| **p-value**  | < 0.05    | Significant regional differences     |
|              | ≥ 0.05    | No significant differences          |

In [None]:
#3.1 ANOVA: Region Impact on Costs
from scipy.stats import f_oneway

# Todo: Group charges by region
regions = ['northeast', 'northwest', 'southeast', 'southwest']
# Todo: One-way ANOVA
print(f"ANOVA Results: F-statistic = {f_stat:.2f}, p-value = {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Significant regional differences exist (reject H0).")
else:
    print("No significant regional differences (fail to reject H0).")


# Post-Hoc Analysis in ANOVA

## What is a Post-Hoc Test?
A **post-hoc test** is performed after finding significant results in ANOVA (p < 0.05) to identify exactly which groups differ.

### Key Properties:
- 🔍 **Purpose**: Pinpoint specific significant differences between groups
- ⚖️ **Controls**: Family-wise error rate (reduces false positives)
- 📊 **Types**: Tukey HSD (most common), Bonferroni, Scheffé

---

# Tukey HSD: The Post-Hoc Test for ANOVA

## 1. Terminology Clarification
- **Tukey HSD** = **Tukey's Honestly Significant Difference** test
- **Turkey** = A country (no relation to statistics)
- **Tukey** = John Tukey, the statistician who developed this method

## 2. What is Tukey HSD?
A **post-hoc test** used after finding significant results in ANOVA to:
- Identify exactly which group pairs are different
- Control for Type I errors (false positives) when making multiple comparisons

## 3. How It Relates to Post-Hoc Analysis
| Concept        | Relationship to Tukey HSD |
|----------------|--------------------------|
| **Post-hoc**   | General term for follow-up tests after ANOVA |
| **Tukey HSD**  | One specific (and most popular) post-hoc method |

## 4. Key Features
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Requires:
# - Significant ANOVA result first (p < 0.05)
# - Normally distributed data
# - Equal group variances
```
# Why Use Tukey HSD Instead of t-tests?

## The Multiple Comparisons Problem

When you have **3+ groups** (e.g., Region A, B, C, D), running individual t-tests between all pairs causes an **inflated Type I error rate** (false positives).

### Example with 4 Groups:
- **Number of pairs**: 6 (A-B, A-C, A-D, B-C, B-D, C-D)
- **Individual t-test error rate**: 5% per test
- **Overall error rate**: 26% chance of ≥1 false positive  
  *(Calculated as `1 - (0.95)^6 = 0.264`)*

## How Tukey HSD Solves This

| Feature               | t-tests               | Tukey HSD             |
|-----------------------|-----------------------|-----------------------|
| **Error Control**     | Per-test (5%)         | Family-wise (5%)      |
| **Adjustment**        | None                  | Corrects for multiple comparisons |
| **Power**             | Higher per-test       | Slightly lower        |
| **Best For**          | Comparing 2 groups    | Comparing all ANOVA group pairs |

### Key Advantage:
Tukey HSD maintains the **overall** Type I error rate at 5% across **all comparisons**, while t-tests would let it grow to 26%.

## Step-by-Step: Tukey HSD Implementation

### 1. Data Preparation
```python
import pandas as pd

# One-hot encode regions
df = pd.get_dummies(df, columns=['region'])  

# Convert to long format for Tukey
df_melt = df.melt(
    id_vars=['charges'],
    value_vars=['region_northeast','region_northwest',
               'region_southeast','region_southwest'],
    var_name='region', 
    value_name='is_region'
)
df_melt = df_melt[df_melt['is_region'] == 1]
```
### 2. Run Tukey Test
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog=df_melt['charges'],  # Target variable (medical costs)
    groups=df_melt['region'],  # Grouping variable (regions)
    alpha=0.05                # Significance level
)
```
### Step 3: Interpret Results

#### Summary Output
```python
print(tukey.summary())
```

### Sample Output
```text
group1           group2         meandiff  p-adj   reject
---------------------------------------------------------
region_northeast region_southeast  1321.29  0.012    True
region_northeast region_northwest  -987.31  0.123   False
region_southeast region_southwest   588.98  0.045    True
```
## How to Read the Results

### meandiff
**Definition**: Difference in average costs between regions  

- **Positive value** (e.g., 1321.29): `group1` > `group2`  
- **Negative value** (e.g., -987.31): `group1` < `group2`  

### p-adj
**Definition**: Adjusted p-value accounting for multiple comparisons  

- **< 0.05**: Statistically significant difference (marked `True` in reject column); Unlikely due to random chance (probability < 5%), groups are truly different. 
- **≥ 0.05**: Not statistically significant (marked `False`);  Could reasonably occur by chance (probability ≥ 5%); No conclusion about differences.

### Examples

#### region_northeast vs region_southeast
- Southeast costs are **$1,321 higher** than Northeast  
- **Significant** (p=0.012 < 0.05)  

#### region_northeast vs region_northwest
- Northwest costs are **$987 lower** than Northeast  
- **Not significant** (p=0.123 > 0.05)  

In [None]:
# 3.2 Tukey HSD Post-Hoc Test
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Todo: Prepare data for Tukey (long format) using melt function

# Todo: Run Tukey HSD
print(tukey.summary())

# Visualize
tukey.plot_simultaneous()
plt.title('Tukey HSD: Regional Cost Comparisons')
plt.show()

# T-Test: Smoker vs Non-Smoker Costs

## Step 1: Data Filtering Process

### Group Creation
The operation creates two comparison groups:

1. **Smoker Group**  
   - Contains medical charges for individuals who smoke  
   - Identified by: `smoker_yes` column value = `1`

2. **Non-Smoker Group**  
   - Contains medical charges for individuals who don't smoke  
   - Identified by: `smoker_yes` column value = `0`

### Technical Implementation
- Uses boolean filtering to select specific rows
- Extracts only the `charges` column values
- Produces two separate data series containing numerical cost values

### Resulting Data Structure
| Group       | Selection Criteria | Data Extracted |
|-------------|--------------------|----------------|
| Smokers     | `smoker_yes == 1`  | Medical charges |
| Non-Smokers | `smoker_yes == 0`  | Medical charges |

### Purpose
- Enables direct cost comparison between smokers and non-smokers
- Isolates the smoking status variable for analysis
- Prepares clean data for statistical testing

### Sample code
```python
smoker = df[df['smoker_yes'] == 1]['charges']
non_smoker = df[df['smoker_yes'] == 0]['charges']
```

## Step 3: Perform T-Test

### What It Does
Conducts an independent samples t-test comparing medical costs between:
- Smokers
- Non-smokers

### Key Parameters
- **Unequal variance assumption** (Welch's t-test):  
  Accounts for cases where the two groups have different variance in their medical costs

### Output Values
1. **T-statistic**  
   - Measures the size of the difference between groups relative to the variation in the data  
   - Higher absolute values indicate stronger evidence against the null hypothesis

2. **P-value**  
   - Estimates the probability of observing such a difference by random chance alone  
   - Used to determine statistical significance (typically p < 0.05)

### Interpretation Guide
| Value | Typical Meaning |
|-------|-----------------|
| Large t-statistic | Strong evidence of difference |
| Small p-value (< 0.05) | Statistically significant result |

### Sample code
```python
t_stat, p_val = ttest_ind(smoker, non_smoker, equal_var=False)
```
## Step 5: Effect Size Calculation (Cohen's d)

### What is Cohen's d?
Cohen's d is a standardized measure of effect size that quantifies the difference between two group means in terms of their combined variability. Unlike p-values which measure statistical significance, Cohen's d measures practical significance by showing how substantial the observed difference actually is.

### Calculation Components
1. **Mean Difference**  
   The raw difference between average medical costs of smokers and non-smokers

2. **Pooled Standard Deviation**  
   A weighted average of both groups' variability that serves as the "yardstick" for standardization

3. **Final Calculation**  
   The mean difference divided by the pooled standard deviation, resulting in a unitless effect size metric

### Interpretation Guidelines
| Cohen's d Value | Effect Size | Practical Meaning |
|-----------------|------------|-------------------|
| 0.2 | Small | Visible but minor difference |
| 0.5 | Medium | Substantial noticeable difference |
| ≥ 0.8 | Large | Clinically important difference |

### Key Advantages
- Allows comparison across different studies
- Not affected by sample size (unlike p-values)
- Provides intuitive understanding of real-world impact
### Sample code
```python
mean_diff = smoker.mean() - non_smoker.mean()
pooled_std = np.sqrt((smoker.std()**2 + non_smoker.std()**2) / 2)
cohens_d = mean_diff / pooled_std
```

In [None]:
# 3.3 T-Test: Smoker vs Non-Smoker Costs
from scipy.stats import ttest_ind

# Todo: Create two groups

# Todo: conduct T-test
print(f"T-test Results: t-statistic = {t_stat:.2f}, p-value = {p_val:.4f}")

# Todo: find out Effect size (Cohen's d)
print(f"Effect Size (Cohen's d): {cohens_d:.2f}")