# Task 3: Statistical Hypothesis Testing of Risk Drivers

This notebook statistically validates or rejects key hypotheses about risk drivers, forming the basis for a new segmentation strategy. We use claim frequency, claim severity, and margin as our KPIs.


## Methodology
- **Claim Frequency:** Proportion of policies with at least one claim.
- **Claim Severity:** Average amount of a claim, given a claim occurred.
- **Margin:** TotalPremium - TotalClaims.

For each hypothesis, we split the data into two groups and use appropriate statistical tests (chi-squared for frequencies, t-test for means). We interpret p-values at the 0.05 significance level.


In [2]:
import sys
import os
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, chi2_contingency
import warnings
sys.path.append(os.path.abspath(os.path.join('..')))
warnings.filterwarnings('ignore')

# Load the processed data (update path if needed)
df = pd.read_csv('../data/processed_task1.csv')

# KPI calculations
df['ClaimOccurred'] = (df['TotalClaims'] > 0).astype(int)
df['Margin'] = df['TotalPremium'] - df['TotalClaims']


## Hypothesis 1: No Risk Differences Across Provinces

- **H₀:** There are no risk differences across provinces (in frequency or severity).
- **Test:** Compare claim frequency (chi-squared) and claim severity (t-test) between two provinces.


In [3]:
# Select two provinces with the largest sample sizes
top_provinces = df['Province'].value_counts().nlargest(2).index.tolist()
province_a, province_b = top_provinces[0], top_provinces[1]
group_a = df[df['Province'] == province_a]
group_b = df[df['Province'] == province_b]

# Claim Frequency: Chi-squared test
contingency = pd.DataFrame({
    province_a: [group_a['ClaimOccurred'].sum(), group_a['ClaimOccurred'].count() - group_a['ClaimOccurred'].sum()],
    province_b: [group_b['ClaimOccurred'].sum(), group_b['ClaimOccurred'].count() - group_b['ClaimOccurred'].sum()]
    }, index=['ClaimOccurred', 'NoClaim'])
chi2, p_chi2, _, _ = chi2_contingency(contingency)

# Claim Severity: t-test (only where claim occurred)
severity_a = group_a[group_a['ClaimOccurred'] == 1]['TotalClaims']
severity_b = group_b[group_b['ClaimOccurred'] == 1]['TotalClaims']
t_stat, p_ttest = ttest_ind(severity_a, severity_b, equal_var=False, nan_policy='omit')

print(f'Provinces compared: {province_a} vs {province_b}')
print(f'Claim Frequency Chi2 p-value: {p_chi2:.4f}')
print(f'Claim Severity t-test p-value: {p_ttest:.4f}')


Provinces compared: Gauteng vs Western Cape
Claim Frequency Chi2 p-value: 0.0000
Claim Severity t-test p-value: 0.0306


## Hypothesis 2: No Risk Differences Between Zip Codes

- **H₀:** There are no risk differences between zip codes (in frequency or severity).
- **Test:** Compare claim frequency (chi-squared) and claim severity (t-test) between two zip codes.


In [4]:
# Select two zip codes with the largest sample sizes
top_zips = df['PostalCode'].value_counts().nlargest(2).index.tolist()
zip_a, zip_b = top_zips[0], top_zips[1]
group_a = df[df['PostalCode'] == zip_a]
group_b = df[df['PostalCode'] == zip_b]

# Claim Frequency: Chi-squared test
contingency_zip = pd.DataFrame({
    zip_a: [group_a['ClaimOccurred'].sum(), group_a['ClaimOccurred'].count() - group_a['ClaimOccurred'].sum()],
    zip_b: [group_b['ClaimOccurred'].sum(), group_b['ClaimOccurred'].count() - group_b['ClaimOccurred'].sum()]
    }, index=['ClaimOccurred', 'NoClaim'])
chi2_zip, p_chi2_zip, _, _ = chi2_contingency(contingency_zip)

# Claim Severity: t-test
severity_a_zip = group_a[group_a['ClaimOccurred'] == 1]['TotalClaims']
severity_b_zip = group_b[group_b['ClaimOccurred'] == 1]['TotalClaims']
t_stat_zip, p_ttest_zip = ttest_ind(severity_a_zip, severity_b_zip, equal_var=False, nan_policy='omit')

print(f'Zip codes compared: {zip_a} vs {zip_b}')
print(f'Claim Frequency Chi2 p-value: {p_chi2_zip:.4f}')
print(f'Claim Severity t-test p-value: {p_ttest_zip:.4f}')


Zip codes compared: 2000 vs 122
Claim Frequency Chi2 p-value: 0.0579
Claim Severity t-test p-value: 0.7002


## Hypothesis 3: No Significant Margin Difference Between Zip Codes

- **H₀:** There are no significant margin (profit) differences between zip codes.
- **Test:** Compare margin (t-test) between two zip codes.


In [5]:
margin_a = group_a['Margin']
margin_b = group_b['Margin']
t_stat_margin, p_margin = ttest_ind(margin_a, margin_b, equal_var=False, nan_policy='omit')
print(f'Margin t-test p-value (zip {zip_a} vs {zip_b}): {p_margin:.4f}')


Margin t-test p-value (zip 2000 vs 122): 0.2445


## Hypothesis 4: No Significant Risk Difference Between Women and Men

- **H₀:** There are not significant risk differences between Women and Men.
- **Test:** Compare claim frequency (chi-squared) and claim severity (t-test) between genders.


In [6]:
# Claim Frequency: Chi-squared test
contingency_gender = pd.DataFrame({
    'Female': [df[df['Gender'] == 'Female']['ClaimOccurred'].sum(), df[df['Gender'] == 'Female']['ClaimOccurred'].count() - df[df['Gender'] == 'Female']['ClaimOccurred'].sum()],
    'Male': [df[df['Gender'] == 'Male']['ClaimOccurred'].sum(), df[df['Gender'] == 'Male']['ClaimOccurred'].count() - df[df['Gender'] == 'Male']['ClaimOccurred'].sum()]
    }, index=['ClaimOccurred', 'NoClaim'])
chi2_gender, p_chi2_gender, _, _ = chi2_contingency(contingency_gender)

# Claim Severity: t-test
severity_female = df[(df['Gender'] == 'Female') & (df['ClaimOccurred'] == 1)]['TotalClaims']
severity_male = df[(df['Gender'] == 'Male') & (df['ClaimOccurred'] == 1)]['TotalClaims']
t_stat_gender, p_ttest_gender = ttest_ind(severity_female, severity_male, equal_var=False, nan_policy='omit')

print(f'Claim Frequency Chi2 p-value (Female vs Male): {p_chi2_gender:.4f}')
print(f'Claim Severity t-test p-value (Female vs Male): {p_ttest_gender:.4f}')


Claim Frequency Chi2 p-value (Female vs Male): 0.9515
Claim Severity t-test p-value (Female vs Male): 0.5680


## Interpretation & Business Recommendations
For each test, if p-value < 0.05, the null hypothesis is rejected and a business recommendation should be made.

- **Example:** If we reject the null hypothesis for provinces (p < 0.01), and Gauteng exhibits a 15% higher loss ratio than Western Cape, a regional risk adjustment to our premiums may be warranted.
