## Introduction

This notebook conducts statistical hypothesis testing to validate or reject key hypotheses about risk drivers for AlphaCare Insurance Solutions (ACIS). The analysis focuses on identifying low-risk segments to optimize marketing strategies and premium pricing for car insurance in South Africa. The key metrics analyzed are **Claim Frequency** (proportion of policies with at least one claim), **Claim Severity** (average claim amount when a claim occurs), and **Margin** (Total Premium - Total Claims). The results will inform a new segmentation strategy to attract low-risk clients with competitive premiums.

### Setup and Imports

This section sets up the Python environment by configuring the system path and importing necessary functions for hypothesis testing. The custom `hypothesis.py` module contains functions to prepare data, calculate metrics, and perform statistical tests.

In [1]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))
import pandas as pd

from scripts.hypothesis import (
    prepare_hypothesis_data,
    check_group_equivalence,
    run_all_hypothesis_tests
)

### Data Loading and Preparation

The dataset, `MachineLearningRating_v3_cleaned.csv`, contains historical insurance claim data. The `prepare_hypothesis_data` function filters the data to ensure valid entries (e.g., positive premiums, non-negative claims) and adds derived columns: `has_claim` (binary indicator for claims) and `margin` (premium minus claims). Only records with valid gender values ('Male', 'Female') are retained.

In [2]:
df = pd.read_csv("../data/processed/MachineLearningRating_v3_cleaned.csv", low_memory=False)
df = prepare_hypothesis_data(df)

Filtered Data: (22487, 53)


This indicates the dataset has been reduced to 22,487 rows and 53 columns after filtering, ensuring data quality for hypothesis testing.

### Hypothesis Testing

This section executes statistical tests to evaluate the following null hypotheses:

1. **H₀**: There are no risk differences across provinces (Gauteng vs. KwaZulu-Natal).
2. **H₀**: There are no risk differences between high and low-risk zones (MainCrestaZone).
3. **H₀**: There are no significant margin differences between high and low-risk zones.
4. **H₀**: There are no significant risk differences between women and men.

The `run_all_hypothesis_tests` function performs:
- **Chi-Squared Tests** for claim frequency (categorical data).
- **T-Tests** for claim severity and margin (numerical data).
- P-values are adjusted using the False Discovery Rate (FDR) method to account for multiple comparisons.

In [3]:
results = run_all_hypothesis_tests(df)
display(results)

Running Hypothesis Tests...


Unnamed: 0,test,metric,group_a,group_b,p_value,p_value_adjusted
0,Chi-Squared,claim_frequency,Gauteng,KwaZulu-Natal,0.043458,0.145128
1,T-Test,claim_severity,Gauteng,KwaZulu-Natal,0.422846,0.60359
2,Chi-Squared,claim_frequency,High,Low,6.5e-05,0.000456
3,T-Test,claim_severity,High,Low,0.961883,0.961883
4,T-Test,margin,High,Low,0.062198,0.145128
5,Chi-Squared,claim_frequency,Female,Male,0.706141,0.823831
6,T-Test,claim_severity,Female,Male,0.431136,0.60359


### Interpretation of Results

- **Province (Gauteng vs. KwaZulu-Natal)**:
  - Claim Frequency: p = 0.145 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in claim frequency.
  - Claim Severity: p = 0.604 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in claim severity.
- **Risk Zones (High vs. Low)**:
  - Claim Frequency: p = 0.00046 (adjusted) < 0.05 → Reject H₀. High-risk zones have significantly higher claim frequency.
  - Claim Severity: p = 0.962 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in claim severity.
  - Margin: p = 0.145 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in margin.
- **Gender (Female vs. Male)**:
  - Claim Frequency: p = 0.824 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in claim frequency.
  - Claim Severity: p = 0.604 (adjusted) > 0.05 → Fail to reject H₀. No significant difference in claim severity.

### Group Equivalence Check

To ensure valid A/B testing, the control (Gauteng) and test (KwaZulu-Natal) groups must be statistically equivalent across key attributes (e.g., Marital Status, Vehicle Type, Cover Type, Cubic Capacity). The `check_group_equivalence` function tests for differences using Chi-Squared tests (categorical variables) or T-Tests (numerical variables).

In [4]:
equivalence = check_group_equivalence(df, 'Province', 'Gauteng', 'KwaZulu-Natal',
                                      ['MaritalStatus', 'VehicleType', 'CoverType', 'cubiccapacity'])
display(equivalence)

Unnamed: 0,column,test,p_value
0,MaritalStatus,Chi-Squared,0.099377
1,VehicleType,Chi-Squared,0.90731
2,cubiccapacity,T-Test,0.0478


### Interpretation of Equivalence

- **Marital Status** (p = 0.099) and **Vehicle Type** (p = 0.907) show no significant differences (p > 0.05), indicating equivalence.
- **Cubic Capacity** (p = 0.048) is slightly below the 0.05 threshold, suggesting a minor difference that could introduce bias. However, this is marginally significant and unlikely to substantially affect results.
- **Cover Type** results are not shown, possibly due to insufficient data or low cell counts in the contingency table. Further investigation may be needed.

Overall, the groups are reasonably balanced, supporting the validity of the province-based hypothesis tests.

## Hypothesis Testing Summary

### Hypotheses and Findings

| Hypothesis | Result | Interpretation |
|------------|--------|----------------|
| H₀: No risk difference between **Gauteng** and **KwaZulu-Natal** | ❌ *Fail to reject* | Risk (claim frequency and severity) is not significantly different between these provinces. |
| H₀: No risk difference between **High vs. Low Risk Zones** | ✅ *Reject* (Claim Frequency) | High-risk zones have significantly higher claim frequency, but no difference in claim severity. |
| H₀: No margin difference between **High vs. Low Risk Zones** | ❌ *Fail to reject* | Margin differences are not statistically significant, though some evidence suggests further investigation. |
| H₀: No risk/severity difference by **Gender** | ❌ *Fail to reject* | Risk and claim severity are statistically equivalent between male and female clients. |

### Business Recommendations

Based on the statistical findings, the following recommendations are proposed to optimize ACIS's marketing and pricing strategies:

1. **Province-Based Pricing**: Do not implement regional pricing differences between Gauteng and KwaZulu-Natal, as the data shows no significant risk variation. This simplifies marketing efforts and avoids unnecessary premium adjustments.
2. **Zone-Based Risk Pricing**: Leverage the significant difference in claim frequency between high and low-risk zones (MainCrestaZone). Offer lower premiums in low-risk zones to attract new clients while adjusting premiums upward in high-risk zones to reflect increased claim frequency.
3. **Gender-Neutral Pricing**: Maintain gender-neutral pricing, as no significant risk differences were found between male and female clients. This supports fair pricing practices and avoids potential regulatory scrutiny.
4. **Monitor Margins by Zone**: While margin differences between high and low-risk zones were not statistically significant, the p-value (0.145) suggests potential differences. Collect additional data to confirm whether zone-based margin adjustments are warranted.
5. **Refine Data Collection**: Address minor group imbalances (e.g., cubic capacity differences) by collecting more comprehensive data or refining segmentation criteria to ensure robust A/B testing in future analyses.

### Next Steps

- **Expand Zone Analysis**: Further segment MainCrestaZone data to identify specific high-risk areas for targeted interventions.
- **Incorporate Additional Features**: Explore other features (e.g., vehicle age, driver experience) in future hypothesis tests to uncover additional risk drivers.
- **Validate with Larger Samples**: Increase sample size for margin analysis to improve statistical power and confirm trends.
- **Integrate with Predictive Models**: Use these findings to inform machine learning models for premium optimization (Task 4).

This analysis provides a solid foundation for ACIS to refine its segmentation strategy, focusing on zone-based risk differences while maintaining equitable pricing across provinces and genders.