## Null Hypotheses to Test

1. **H₀₁**: There are no risk differences across provinces
2. **H₀₂**: There are no risk differences between zip codes
3. **H₀₃**: There is no significant margin (profit) difference between zip codes
4. **H₀₄**: There is no significant risk difference between Women and Men

## Metrics Definition

- **Claim Frequency**: Proportion of policies with at least one claim
- **Claim Severity**: Average amount of a claim (for policies with claims > 0)
- **Margin**: TotalPremium - TotalClaims
- **Loss Ratio**: TotalClaims / TotalPremium

In [None]:
# Import libraries
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

from scripts.data_loader import DataLoader
from scripts.preprocessing import DataPreprocessor
from scripts.hypothesis_testing import HypothesisTester

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("Libraries imported successfully!")

In [None]:
# Load and prepare data
DATA_PATH = '../data/MachineLearningRating_v3.txt'

loader = DataLoader(DATA_PATH)
df = loader.load_data()

# Preprocess
preprocessor = DataPreprocessor(df)
df_clean = preprocessor.convert_data_types()
df_clean = preprocessor.create_features()

print(f"Data loaded: {df_clean.shape}")
print(f"\nAvailable columns: {df_clean.columns.tolist()[:10]}...")

## 1. Hypothesis Test 1: Risk Differences Across Provinces

In [None]:
# Initialize hypothesis tester
tester = HypothesisTester(df_clean)

# Test provinces
province_results = tester.test_provinces()
print("\n=== PROVINCES TEST RESULTS ===")
print(f"Test Type: {province_results.get('test_type', 'N/A')}")
print(f"P-Value: {province_results.get('p_value', 1):.6f}")
print(f"Decision: {'REJECT H₀' if province_results.get('reject_null') else 'FAIL TO REJECT H₀'}")

if province_results.get('reject_null'):
    print("\n✓ There ARE significant risk differences across provinces")
else:
    print("\n✗ No significant risk differences across provinces")

In [None]:
# Visualize provincial differences
if 'Province' in df_clean.columns and 'group_metrics' in province_results:
    metrics_df = pd.DataFrame(province_results['group_metrics']).T
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    
    # Loss Ratio by Province
    axes[0, 0].bar(metrics_df.index, metrics_df['loss_ratio'])
    axes[0, 0].set_title('Loss Ratio by Province', fontsize=14, fontweight='bold')
    axes[0, 0].set_xlabel('Province')
    axes[0, 0].set_ylabel('Loss Ratio')
    axes[0, 0].tick_params(axis='x', rotation=45)
    axes[0, 0].axhline(y=1.0, color='r', linestyle='--', label='Break-even')
    axes[0, 0].legend()
    
    # Claim Frequency by Province
    axes[0, 1].bar(metrics_df.index, metrics_df['claim_frequency'], color='orange')
    axes[0, 1].set_title('Claim Frequency by Province', fontsize=14, fontweight='bold')
    axes[0, 1].set_xlabel('Province')
    axes[0, 1].set_ylabel('Claim Frequency')
    axes[0, 1].tick_params(axis='x', rotation=45)
    
    # Average Margin by Province
    axes[1, 0].bar(metrics_df.index, metrics_df['avg_margin'], color='green')
    axes[1, 0].set_title('Average Margin by Province', fontsize=14, fontweight='bold')
    axes[1, 0].set_xlabel('Province')
    axes[1, 0].set_ylabel('Average Margin (R)')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].axhline(y=0, color='r', linestyle='--')
    
    # Number of Policies by Province
    axes[1, 1].bar(metrics_df.index, metrics_df['n_policies'], color='purple')
    axes[1, 1].set_title('Policy Count by Province', fontsize=14, fontweight='bold')
    axes[1, 1].set_xlabel('Province')
    axes[1, 1].set_ylabel('Number of Policies')
    axes[1, 1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

### Business Interpretation - Provinces

Based on the statistical test results:
- If H₀ is rejected: Different provinces have significantly different risk profiles
- Action: Implement province-specific premium adjustments
- Low-risk provinces: Consider premium reductions to attract more customers
- High-risk provinces: Adjust premiums upward to maintain profitability

## 2. Hypothesis Test 2 & 3: Postal Code Analysis (Risk & Margin)

In [None]:
# Test postal codes
postal_results = tester.test_postal_codes(sample_size=15)
print("\n=== POSTAL CODES TEST RESULTS ===")
print(f"Test Type: {postal_results.get('test_type', 'N/A')}")
print(f"P-Value: {postal_results.get('p_value', 1):.6f}")
print(f"Decision: {'REJECT H₀' if postal_results.get('reject_null') else 'FAIL TO REJECT H₀'}")

if postal_results.get('reject_null'):
    print("\n✓ There ARE significant risk/margin differences across postal codes")
else:
    print("\n✗ No significant risk/margin differences across postal codes")

In [None]:
# Visualize postal code differences
if 'group_metrics' in postal_results:
    postal_metrics = pd.DataFrame(postal_results['group_metrics']).T
    postal_metrics = postal_metrics.sort_values('loss_ratio', ascending=False)
    
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Top 15 Postal Codes by Loss Ratio
    axes[0].barh(range(len(postal_metrics)), postal_metrics['loss_ratio'])
    axes[0].set_yticks(range(len(postal_metrics)))
    axes[0].set_yticklabels(postal_metrics.index)
    axes[0].set_xlabel('Loss Ratio')
    axes[0].set_title('Loss Ratio by Postal Code (Top 15)', fontsize=14, fontweight='bold')
    axes[0].axvline(x=1.0, color='r', linestyle='--', label='Break-even')
    axes[0].legend()
    axes[0].invert_yaxis()
    
    # Average Margin by Postal Code
    postal_metrics_margin = postal_metrics.sort_values('avg_margin', ascending=False)
    colors = ['green' if x > 0 else 'red' for x in postal_metrics_margin['avg_margin']]
    axes[1].barh(range(len(postal_metrics_margin)), postal_metrics_margin['avg_margin'], color=colors)
    axes[1].set_yticks(range(len(postal_metrics_margin)))
    axes[1].set_yticklabels(postal_metrics_margin.index)
    axes[1].set_xlabel('Average Margin (R)')
    axes[1].set_title('Average Margin by Postal Code (Top 15)', fontsize=14, fontweight='bold')
    axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
    axes[1].invert_yaxis()
    
    plt.tight_layout()
    plt.show()
    
    # Print top and bottom performers
    print("\n=== TOP 5 POSTAL CODES (Lowest Risk) ===")
    top_5 = postal_metrics.nsmallest(5, 'loss_ratio')[['loss_ratio', 'claim_frequency', 'avg_margin', 'n_policies']]
    print(top_5)
    
    print("\n=== BOTTOM 5 POSTAL CODES (Highest Risk) ===")
    bottom_5 = postal_metrics.nlargest(5, 'loss_ratio')[['loss_ratio', 'claim_frequency', 'avg_margin', 'n_policies']]
    print(bottom_5)

### Business Interpretation - Postal Codes

Key Insights:
- **Low-risk postal codes** (loss ratio < 0.7): Opportunity for competitive pricing
- **High-risk postal codes** (loss ratio > 1.2): Require premium increases or risk mitigation
- **Margin analysis** reveals profitable vs unprofitable geographic segments

## 3. Hypothesis Test 4: Gender Risk Differences

In [None]:
# Test gender differences
gender_results = tester.test_gender()

if 'error' not in gender_results:
    print("\n=== GENDER TEST RESULTS ===")
    
    chi_test = gender_results['chi_squared_test']
    print("\nChi-Squared Test (Claim Frequency):")
    print(f"  P-Value: {chi_test['p_value']:.6f}")
    print(f"  Decision: {'REJECT H₀' if chi_test['reject_null'] else 'FAIL TO REJECT H₀'}")
    
    t_test = gender_results['t_test_margin']
    print("\nT-Test (Margin Difference):")
    print(f"  P-Value: {t_test['p_value']:.6f}")
    print(f"  Decision: {'REJECT H₀' if t_test['reject_null'] else 'FAIL TO REJECT H₀'}")
    
    if chi_test['reject_null'] or t_test['reject_null']:
        print("\n✓ There ARE significant risk differences between genders")
    else:
        print("\n✗ No significant risk differences between genders")
else:
    print(f"Error in gender test: {gender_results['error']}")

In [None]:
# Visualize gender differences
if 'error' not in gender_results:
    male_metrics = gender_results['chi_squared_test']['group_a_metrics']
    female_metrics = gender_results['chi_squared_test']['group_b_metrics']
    
    comparison_data = pd.DataFrame({
        'Male': [male_metrics['claim_frequency'], male_metrics['loss_ratio'], 
                male_metrics['avg_margin'], male_metrics['claim_severity']],
        'Female': [female_metrics['claim_frequency'], female_metrics['loss_ratio'], 
                  female_metrics['avg_margin'], female_metrics['claim_severity']]
    }, index=['Claim Frequency', 'Loss Ratio', 'Avg Margin', 'Claim Severity'])
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    metrics = ['Claim Frequency', 'Loss Ratio', 'Avg Margin', 'Claim Severity']
    colors_list = ['skyblue', 'lightcoral']
    
    for idx, metric in enumerate(metrics):
        ax = axes[idx // 2, idx % 2]
        comparison_data.loc[metric].plot(kind='bar', ax=ax, color=colors_list)
        ax.set_title(f'{metric} by Gender', fontsize=12, fontweight='bold')
        ax.set_xlabel('Gender')
        ax.set_ylabel(metric)
        ax.tick_params(axis='x', rotation=0)
        if metric == 'Loss Ratio':
            ax.axhline(y=1.0, color='r', linestyle='--', linewidth=1)
        if metric == 'Avg Margin':
            ax.axhline(y=0, color='r', linestyle='--', linewidth=1)
    
    plt.tight_layout()
    plt.show()
    
    print("\n=== GENDER COMPARISON ===")
    print(comparison_data)

### Business Interpretation - Gender

**Important Note**: While statistical differences may exist, pricing based on gender may be:
- Prohibited by law in many jurisdictions
- Considered discriminatory
- Subject to regulatory restrictions

**Alternative Approach**: Use correlated factors that are legally permissible:
- Vehicle type preferences
- Driving behavior data
- Geographic location
- Coverage choices

## 4. Summary of Hypothesis Testing Results

In [None]:
# Generate comprehensive report
all_results = {
    'provinces': province_results,
    'postal_codes': postal_results,
    'gender': gender_results
}

report = tester.generate_report(all_results)
print(report)

# Save report
with open('../reports/hypothesis_testing_results.txt', 'w') as f:
    f.write(report)
    
print("\nReport saved to: ../reports/hypothesis_testing_results.txt")

## 5. Business Recommendations

### Based on Hypothesis Testing Results:

#### 1. Geographic Risk-Based Pricing
- **Action**: Implement differentiated pricing by province and postal code
- **Low-risk areas**: Reduce premiums by 5-15% to gain market share
- **High-risk areas**: Increase premiums by 10-25% or implement risk mitigation

#### 2. Targeted Marketing Strategy
- **Focus on low-risk segments** for customer acquisition
- **Develop retention programs** for high-value, low-risk customers
- **Geographic expansion** prioritizing low-risk provinces/postal codes

#### 3. Risk Mitigation Measures
- **High-risk areas**: 
  - Require additional safety features (tracking devices, immobilizers)
  - Offer discounts for risk-reducing behaviors
  - Implement stricter underwriting criteria

#### 4. Product Development
- **Create geographic-specific products** tailored to regional risk profiles
- **Usage-based insurance** for high-risk areas
- **Bundle products** with risk-reduction services

#### 5. Continuous Monitoring
- **Quarterly review** of geographic risk patterns
- **A/B testing** of pricing strategies
- **Claims trend analysis** for early risk detection

## Next Steps

1. **Task 4**: Build predictive models for claim severity and premium optimization
2. **Feature Engineering**: Incorporate geographic risk factors into models
3. **Model Deployment**: Create API for real-time premium calculation
4. **Monitoring System**: Track model performance and risk metrics over time