In [1]:
import pandas as pd
import numpy as np

import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

from src.hypothesis_tests import HypothesisTester

from src.eda_preprocessing import EDADataPreprocessor

In [2]:
DATA_PATH = '../data/MachineLearningRating_v3_cleaned.txt'

DELIMITER = '|'

loader = EDADataPreprocessor(DATA_PATH, DELIMITER)
df = loader.load_data()


Loaded: 1000024 rows, 50 columns


### Compute KPIs

In [3]:
# Margin = TotalPremium - TotalClaims
df["Margin"] = df["TotalPremium"] - df["TotalClaims"]

# Claim Frequency = 1 if TotalClaims > 0, else 0
df["ClaimFrequency"] = (df["TotalClaims"] > 0).astype(int)

# Claim Severity = TotalClaims for policies with claims only
df["ClaimSeverity"] = df["TotalClaims"]
df.loc[df["ClaimSeverity"] == 0, "ClaimSeverity"] = np.nan

# Check KPIs
df[["TotalPremium", "TotalClaims", "Margin", "ClaimFrequency", "ClaimSeverity"]].head()


Unnamed: 0,TotalPremium,TotalClaims,Margin,ClaimFrequency,ClaimSeverity
0,21.929825,0.0,21.929825,0,
1,21.929825,0.0,21.929825,0,
2,0.0,0.0,0.0,0,
3,512.84807,0.0,512.84807,0,
4,0.0,0.0,0.0,0,


### Initialize Hypothesis Tester

In [4]:
tester = HypothesisTester(df)

### Hypothesis 1: Risk Differences Across Provinces

#### Understand the Hypothesis

- Goal: Assess whether the risk profile differs across provinces, using two key metrics: claim frequency (likelihood of a claim) and claim severity (average size of a claim).

- Risk is measured by:

    1. Claim Frequency – how often problems happen (1 if problem happened, 0 if not).

    2. Claim Severity – how big the problem is when it happens (amount of money).

- Null Hypothesis (H₀): All provinces have the same risk.

- Alternative Hypothesis (H₁): At least one province is different.

In [5]:
# Claim Frequency: categorical -> chi-squared
province_freq_result = tester.chi_squared_test("Province", "ClaimFrequency")
print("Claim Frequency across Provinces: Categorical", province_freq_result)

# Claim Severity: numerical -> ANOVA
province_severity_result = tester.anova_test("Province", "ClaimSeverity")
print("Claim Severity across Provinces Numerical:", province_severity_result)




Claim Frequency across Provinces: Categorical TestResult(statistic=104.1107, p_value=0.000000, reject_null=True)
Claim Severity across Provinces Numerical: TestResult(statistic=4.8692, p_value=0.000006, reject_null=True)


#### Interpretation:

The results indicate statistically significant differences in both claim frequency and claim severity across provinces.

Certain provinces exhibit higher claim frequencies, indicating a greater likelihood of claims occurring.

Similarly, claim severity differs across provinces, with some provinces experiencing larger average claim amounts.

### Hypothesis 2: Risk Differences Across Top ZIP Codes

Goal: Check whether risk differs between postal codes.

- Risk metrics: Claim Frequency and Claim Severity (same as before).


Null Hypotheses (H₀):

1. There are no risk differences between zip codes (Claim Frequency & Claim Severity).

2. There is no significant margin (profit) difference between zip codes.

Alternative Hypotheses (H₁):

- At least one zip code differs in risk or margin.

In [6]:
# Select top 10 postal codes by count
top_zip_codes = df["PostalCode"].value_counts().nlargest(10).index

# Filter dataframe to only those ZIP codes
df_top_zip = df[df["PostalCode"].isin(top_zip_codes)]

# Update the tester's dataframe
tester.df = df_top_zip

# Claim Frequency: Chi-square
zip_freq_result = tester.chi_squared_test("PostalCode", "ClaimFrequency")
print("Claim Frequency across Top ZIP Codes:", zip_freq_result)

# Claim Severity: ANOVA
zip_severity_result = tester.anova_test("PostalCode", "ClaimSeverity")
print("Claim Severity across Top ZIP Codes:", zip_severity_result)



Claim Frequency across Top ZIP Codes: TestResult(statistic=72.6494, p_value=0.000000, reject_null=True)
Claim Severity across Top ZIP Codes: TestResult(statistic=5.2358, p_value=0.000001, reject_null=True)


#### Interpretation:

There are statistically significant differences in claim frequency across the top ZIP codes. This means that customers in certain ZIP code areas are more likely to submit a claim compared to others.

There are also significant differences in claim severity across these ZIP codes. Some areas tend to have higher average claim amounts, indicating that when claims occur, they are more costly in certain regions.



### Hypothesis 3 : Margin differences across ZIP codes

Goal: Check if the average profit (Margin) differs across ZIP codes.

Metric: Margin = TotalPremium − TotalClaims

- Null Hypothesis (H₀): There are no significant margin differences between ZIP codes.

- Alternative Hypothesis (H₁): At least one ZIP code has a different average margin.

Test: Since Margin is numerical, we’ll use ANOVA to compare the means across groups.

In [7]:
# Select top 10 ZIP codes by count
top_zip_codes = df["PostalCode"].value_counts().nlargest(10).index

# Filter dataframe to only those ZIP codes
df_top_zip = df[df["PostalCode"].isin(top_zip_codes)]

# Update the tester's dataframe
tester.df = df_top_zip


In [8]:
# Margin differences across top ZIP codes
zip_margin_result = tester.anova_test("PostalCode", "Margin")
print("Margin across Top ZIP Codes:", zip_margin_result)



Margin across Top ZIP Codes: TestResult(statistic=1.0506, p_value=0.396364, reject_null=False)



#### Interpretation: 
There is no statistically significant difference in average margin across the top 10 ZIP codes. Profit levels appear fairly consistent, so no ZIP code-specific adjustment is warranted based on margin.

Margin does not differ significantly across the top ZIP codes. Despite differences in how often and how severely customers claim, overall profitability remains relatively stable across these areas.

### Hypothesis 4: Risk Differences Between Women and Men.

Objective: Check if the risk profile differs by gender.

Metrics:

Claim Frequency: How often claims happen (categorical → chi-squared)

Claim Severity: How large claims are on average (numerical → t-test)

Null Hypothesis (H₀): There is no significant difference in risk between women and men.

Alternative Hypothesis (H₁): Women and men have significantly different risk.

In [9]:
gender_freq_result = tester.chi_squared_test("Gender", "ClaimFrequency")
print("Claim Frequency by Gender:", gender_freq_result)

gender_severity_result = tester.ttest_two_groups(
    group_col="Gender", 
    value_col="ClaimSeverity", 
    group_a="Male", 
    group_b="Female"
)
print("Claim Severity by Gender:", gender_severity_result)


Claim Frequency by Gender: TestResult(statistic=1.2353, p_value=0.539217, reject_null=False)
Claim Severity by Gender: TestResult(statistic=-1.6846, p_value=0.210953, reject_null=False)


#### Interpretation: 

Claim Frequency by Gender: There is no significant difference in how often men and women file claims, suggesting that claim likelihood is similar across genders.

Claim Severity by Gender: Similarly, the average claim amount is comparable between men and women, indicating no meaningful difference in claim severity.

Overall, gender does not appear to influence either the frequency or severity of claims and may not need to be considered in risk-based segmentation strategies.