# Task 3: Statistical Hypothesis Testing for Risk Drivers (Using Utils)

This notebook statistically validates or rejects key hypotheses about risk drivers in the insurance dataset, utilizing modular utility functions for clarity and maintainability.

In [16]:
# Import Required Libraries
import pandas as pd
import numpy as np
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.pardir, os.pardir)))
import matplotlib.pyplot as plt
import seaborn as sns
sys.path.insert(0, '../src')
project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
if project_root not in sys.path:
    sys.path.insert(0, project_root)


project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
src_path = os.path.join(project_root, 'src')
utils_path = os.path.join(src_path, 'utils')
from utils.task3_utils import chi2_test, t_test, claim_frequency, claim_severity, margin, plot_group_metric
open(os.path.join(src_path, '__init__.py'), 'a').close()
open(os.path.join(utils_path, '__init__.py'), 'a').close()


In [17]:
# Load Data

df = pd.read_csv('../data/MachineLearningRating_v3.txt', delimiter='	')
# Display basic info
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000098 entries, 0 to 1000097
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Non-Null Count    Dtype 
---  ------                                                                                                                                                                                                         

Unnamed: 0,UnderwrittenCoverID|PolicyID|TransactionMonth|IsVATRegistered|Citizenship|LegalType|Title|Language|Bank|AccountType|MaritalStatus|Gender|Country|Province|PostalCode|MainCrestaZone|SubCrestaZone|ItemType|mmcode|VehicleType|RegistrationYear|make|Model|Cylinders|cubiccapacity|kilowatts|bodytype|NumberOfDoors|VehicleIntroDate|CustomValueEstimate|AlarmImmobiliser|TrackingDevice|CapitalOutstanding|NewVehicle|WrittenOff|Rebuilt|Converted|CrossBorder|NumberOfVehiclesInFleet|SumInsured|TermFrequency|CalculatedPremiumPerTerm|ExcessSelected|CoverCategory|CoverType|CoverGroup|Section|Product|StatutoryClass|StatutoryRiskType|TotalPremium|TotalClaims
0,145249|12827|2015-03-01 00:00:00|True| |Close...
1,145249|12827|2015-05-01 00:00:00|True| |Close...
2,145249|12827|2015-07-01 00:00:00|True| |Close...
3,145255|12827|2015-05-01 00:00:00|True| |Close...
4,145255|12827|2015-07-01 00:00:00|True| |Close...


## Hypotheses and Metrics
We test the following null hypotheses (Hâ‚€):

- There are no risk differences across provinces
- There are no risk differences between zip codes
- There are no significant margin (profit) differences between zip codes
- There are no significant risk differences between Women and Men

**Metrics:**
- Claim Frequency
- Claim Severity
- Margin (TotalPremium - TotalClaims)

In [18]:
file_path = '../data/MachineLearningRating_v3.txt'

for sep in [',', '|', '\t']:
    print(f"\nTrying sep='{sep}':")
    try:
        df_sample = pd.read_csv(file_path, sep=sep, nrows=5)
        print(df_sample.head())
        print("Columns:", df_sample.columns.tolist())
    except Exception as e:
        print("Error:", e)


Trying sep=',':
  UnderwrittenCoverID|PolicyID|TransactionMonth|IsVATRegistered|Citizenship|LegalType|Title|Language|Bank|AccountType|MaritalStatus|Gender|Country|Province|PostalCode|MainCrestaZone|SubCrestaZone|ItemType|mmcode|VehicleType|RegistrationYear|make|Model|Cylinders|cubiccapacity|kilowatts|bodytype|NumberOfDoors|VehicleIntroDate|CustomValueEstimate|AlarmImmobiliser|TrackingDevice|CapitalOutstanding|NewVehicle|WrittenOff|Rebuilt|Converted|CrossBorder|NumberOfVehiclesInFleet|SumInsured|TermFrequency|CalculatedPremiumPerTerm|ExcessSelected|CoverCategory|CoverType|CoverGroup|Section|Product|StatutoryClass|StatutoryRiskType|TotalPremium|TotalClaims
0  145249|12827|2015-03-01 00:00:00|True|  |Close...                                                                                                                                                                                                                                                                                            

## Data Segmentation
For each hypothesis, we define Group A and Group B and ensure comparability.

In [19]:
df = pd.read_csv('../data/MachineLearningRating_v3.txt', sep='|')
print(df.columns.tolist())
print(df.head())

  df = pd.read_csv('../data/MachineLearningRating_v3.txt', sep='|')


['UnderwrittenCoverID', 'PolicyID', 'TransactionMonth', 'IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province', 'PostalCode', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'mmcode', 'VehicleType', 'RegistrationYear', 'make', 'Model', 'Cylinders', 'cubiccapacity', 'kilowatts', 'bodytype', 'NumberOfDoors', 'VehicleIntroDate', 'CustomValueEstimate', 'AlarmImmobiliser', 'TrackingDevice', 'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'CrossBorder', 'NumberOfVehiclesInFleet', 'SumInsured', 'TermFrequency', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass', 'StatutoryRiskType', 'TotalPremium', 'TotalClaims']
   UnderwrittenCoverID  PolicyID     TransactionMonth  IsVATRegistered  \
0               145249     12827  2015-03-01 00:00:00             True   
1               145249     12827  2015-05-01 00:00

## Statistical Testing
Appropriate tests are performed for each hypothesis using utility functions.

- Chi-squared for categorical (claim frequency)
- T-test for numerical (claim severity, margin)

In [20]:
# Create a binary column for claim occurrence
df['HasClaim'] = df['TotalClaims'] > 0

# Segment by Province
group_a = df[df['Province'] == 'Gauteng']
group_b = df[df['Province'] == 'Western Cape']

# Segment by Gender
group_men = df[df['Gender'] == 'Male']
group_women = df[df['Gender'] == 'Female']


# Chi-squared test for claim frequency between provinces
chi2, p, dof, expected = chi2_test(df, 'Province', 'HasClaim')
print(f'Chi-squared test p-value (Province): {p:.4f}')

# T-test for claim severity between two provinces
sev_a = group_a[group_a['TotalClaims'] > 0]['TotalClaims']
sev_b = group_b[group_b['TotalClaims'] > 0]['TotalClaims']
t_stat, p_val = t_test(sev_a, sev_b)
print(f'T-test p-value (Claim Severity, Gauteng vs Western Cape): {p_val:.4f}')

# T-test for margin between two postal codes
print(df['PostalCode'].unique()) 

zip_a = df[df['PostalCode'] == '2000']
zip_b = df[df['PostalCode'] == '8001']
margin_a = zip_a['TotalPremium'] - zip_a['TotalClaims']
margin_b = zip_b['TotalPremium'] - zip_b['TotalClaims']
t_stat, p_val = t_test(margin_a, margin_b)
print(f'T-test p-value (Margin, PostalCode 2000 vs 8001): {p_val:.4f}')

# T-test for risk difference between Women and Men
sev_men = group_men[group_men['TotalClaims'] > 0]['TotalClaims']
sev_women = group_women[group_women['TotalClaims'] > 0]['TotalClaims']
t_stat, p_val = t_test(sev_men, sev_women)
print(f'T-test p-value (Claim Severity, Men vs Women): {p_val:.4f}')

Chi-squared test p-value (Province): 0.0000
T-test p-value (Claim Severity, Gauteng vs Western Cape): 0.0306
[1459 1513 1619 1625 1629 1852 1982 2007 2066 4093 2000 1577 1610 2410
 6200  122 1520 1709 1739 4000 4066 4091 4342 4359 7784  970 6213 6390
 1868 4310  299  309  152  181 1821 4449 4037  139 4074 1057 7100 9725
 1863 1875 2001 2091 3170 3950 1021 2380  300  302  458 7750  157 4811
 4930 5000 5090 5160 5219 5410 5920 6025 6139 5040 6201 6212 6231 9744
    1    8   64   84  162  164 8000  182  183  186  190 5326  192  194
  199  200  201  208  258  264 1431 1441 1455 1494 1496  284 9762 1507
 1540 1559 1571 1724 1754 1757 1759 1779 1803 1804 1806 1809 1818 1828
 1830 1862 1864 1865 1984 2014 2019 2021 2040 2090 2188 2198 3180 3200
 3245 3310 3380 3609 3610 3612 3613 3370 3600 3629 3630 3650 3780 3900
 3934 3973 5143 3880 3882 3915 4001 4004 4011 4023 4027 4051 4052 4053
 4056 4057 4059 4060 4061 4063 4071 4089 4092 4105 4110 4111 4126 4137
 4140 4180 4200 4240 4260 4309 4340 436

  return f(*args, **kwargs)


## Interpretation & Business Recommendation
For each rejected hypothesis (p < 0.05), provide clear business interpretation. Example:

- We reject the null hypothesis for provinces (p < 0.01). Gauteng exhibits a 15% higher loss ratio than Western Cape, suggesting a regional risk adjustment to our premiums may be warranted.