# Task 3 – Hypothesis Testing on Risk Drivers
This notebook evaluates four business hypotheses using a **single KPI – Loss Ratio (TotalClaims / TotalPremium)**.

- **H₀ 1**: No risk (loss-ratio) difference across *provinces*
- **H₀ 2**: No risk difference between *zip codes*
- **H₀ 3**: No *margin* (profit) difference between zip codes
- **H₀ 4**: No risk difference between *Women* and *Men*

A p-value < 0.05 (α = 0.05) leads us to **reject** the null hypothesis.

In [None]:
# Imports
import pandas as pd, numpy as np
from pathlib import Path
from scipy import stats
import seaborn as sns, matplotlib.pyplot as plt
sns.set(style="whitegrid")

In [None]:
# ---- 1. Load data ----
DATA_PATH = Path('data') / 'MachineLearningRating_v3.txt'
df = pd.read_csv(DATA_PATH, sep='|', low_memory=False)
print('Shape:', df.shape)

In [None]:
# ---- 2. KPI Engineering ----
df['LossRatio'] = df['TotalClaims'] / df['TotalPremium'].replace({0: np.nan})
df = df.dropna(subset=['LossRatio'])
df['LossRatio'] = df['LossRatio'].clip(upper=10)  # winsorise extreme outliers
df['Margin']    = df['TotalPremium'] - df['TotalClaims']
df['Gender']    = df['Gender'].str.title().fillna('Unknown')
print(df[['LossRatio','Margin']].describe().T)

## 3. Hypothesis 1 — Province vs Loss Ratio

In [None]:
groups = [g['LossRatio'].values for _, g in df.groupby('Province') if len(g) >= 30]  # min sample 30
stat, p = stats.kruskal(*groups)
print(f'Kruskal-Wallis H={stat:.2f}, p={p:.4g}')
print('Reject H0' if p < 0.05 else 'Fail to reject H0')

## 4. Hypothesis 2 — Zip Code vs Loss Ratio

In [None]:
zip_groups = [g['LossRatio'].values for _, g in df.groupby('PostalCode') if len(g) >= 30]  # keep all codes with >=30 records
stat2, p2 = stats.kruskal(*zip_groups)
print(f'Kruskal-Wallis H={stat2:.2f}, p={p2:.4g}')
print('Reject H0' if p2 < 0.05 else 'Fail to reject H0')

## 5. Hypothesis 3 — Margin Difference between Zip Codes

In [None]:
margin_groups = [g['Margin'].values for _, g in df.groupby('PostalCode') if len(g) >= 30]
stat3, p3 = stats.kruskal(*margin_groups)
print(f'Kruskal-Wallis H={stat3:.2f}, p={p3:.4g}')
print('Reject H0' if p3 < 0.05 else 'Fail to reject H0')

## 6. Hypothesis 4 — Gender vs Loss Ratio

In [None]:
male = df[df['Gender']=='Male']['LossRatio']
female = df[df['Gender']=='Female']['LossRatio']
stat4, p4 = stats.mannwhitneyu(male, female, alternative='two-sided')
print(f'Mann-Whitney U={stat4:.2e}, p={p4:.4g}')
print('Reject H0' if p4 < 0.05 else 'Fail to reject H0')

## 7. Summary & Next Steps
*Run all cells, review the printed decisions, then add business interpretation here.*