In [1]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import os, sys
# Add the 'scripts' directory to the Python path for module imports
sys.path.append(os.path.abspath(os.path.join('..', 'scripts')))

In [2]:

# Read the dataset
df = pd.read_csv('../data/cleanedDataset.csv', low_memory=False, index_col=False)

In [3]:

# Initialize the class
from hypothesis_testing import ABHypothesisTesting
ab_test = ABHypothesisTesting(df)
# A/B Test 1: Compare two provinces (e.g., Province_A and Province_B)
result_province = ab_test.perform_ab_test('Province', 'Province_A', 'Province_B', 'TotalPremium')
print('*** A/B Test: Risk Differences Between Province_A and Province_B ***')
print(result_province)
print()

# A/B Test 2: Compare two postal codes (e.g., PostalCode 1234 and PostalCode 5678)
result_postalcode = ab_test.perform_ab_test('PostalCode', '1234', '5678', 'TotalPremium')
print('*** A/B Test: Risk Differences Between PostalCode 1234 and PostalCode 5678 ***')
print(result_postalcode)
print()

# A/B Test 3: Compare gender (e.g., Male and Female)
result_gender = ab_test.perform_ab_test('Gender', 'Male', 'Female', 'TotalPremium')
print('*** A/B Test: Risk Differences Between Men and Women ***')
print(result_gender)
print()

*** A/B Test: Risk Differences Between Province_A and Province_B ***
One of the groups is empty. A/B test cannot be performed on Province_A and Province_B.

*** A/B Test: Risk Differences Between PostalCode 1234 and PostalCode 5678 ***
One of the groups is empty. A/B test cannot be performed on 1234 and 5678.

*** A/B Test: Risk Differences Between Men and Women ***
Z-test on TotalPremium: Z-statistic = -0.8013377046545027, p-value = 0.42293616848457005
Fail to reject the null hypothesis.



#### A/B Test: Risk Differences Between Men and Women
Result: "Z-test on TotalPremium: Z-statistic = -0.8013377046545027, p-value = 0.42293616848457005. Fail to reject the null hypothesis."

Interpretation:

The Z-statistic value of -0.801 shows that the difference in the means of TotalPremium between men and women is relatively small.
The p-value of 0.423 is much larger than the significance level (typically 0.05), which means we fail to reject the null hypothesis.
Conclusion: There is no statistically significant difference in the risk (as measured by TotalPremium) between men and women in the dataset. In other words, based on this analysis, gender does not seem to influence the total premium amount significantly.

In [4]:
# Initialize the class
from hypothesis_testing import ABHypothesisTesting
ab_test = ABHypothesisTesting(df)

# Run all tests and print results in a human-readable format
results = ab_test.run_all_tests()

for test_name, result in results.items():
    print(f'*** {test_name} ***')
    print(result)
    print()  # Print a newline for better readability

*** Risk Differences Across Provinces ***
Chi-squared test on Province and TotalPremium: chi2 = 2198720.0560572664, p-value = 0.0
Reject the null hypothesis.

*** Risk Differences Between Postal Codes ***
Chi-squared test on PostalCode and TotalPremium: chi2 = 182589197.14373723, p-value = 0.0
Reject the null hypothesis.

*** Margin Differences Between Postal Codes ***
Z-test on TotalPremium: Z-statistic = -4.256402256391749, p-value = 2.0774279581692312e-05
Reject the null hypothesis.

*** Risk Differences Between Women and Men ***
T-test on TotalPremium: T-statistic = -0.82823800820278, p-value = 0.4075400891888227
Fail to reject the null hypothesis.



### Detail Interpretation

#### 1. Risk Differences Across Provinces
Chi-squared test on Province and TotalPremium:

Chi-squared statistic (χ²): 2198720.056\
p-value: 0.0 (effectively very close to zero)\
Conclusion: "Reject the null hypothesis."\
Interpretation: 

The chi-squared test is used to determine if there is a significant relationship between categorical variables (in this case, Province and TotalPremium).
Since the p-value is 0.0, this indicates that the differences in risk (as measured by TotalPremium) across provinces are statistically significant.
Conclusion: There are significant risk differences across different provinces in your dataset. This suggests that location (province) is an important factor in determining insurance risk, and AlphaCare Insurance Solutions (ACIS) may want to consider adjusting premiums based on the province of residence.
#### 2. Risk Differences Between Postal Codes
Chi-squared test on PostalCode and TotalPremium:

Chi-squared statistic (χ²): 182589197.144\
p-value: 0.0\
Conclusion: "Reject the null hypothesis."\
Interpretation:

Similar to the provincial test, the chi-squared test here assesses whether there are significant differences in TotalPremium based on postal codes.
With a p-value of 0.0, you can conclude that there are statistically significant differences in risk (as measured by TotalPremium) between postal codes.
Conclusion: The results suggest that certain postal codes may have higher or lower risks, and therefore, ACIS might want to consider postal code-based segmentation for more precise risk assessments and targeted marketing strategies.
#### 3. Margin Differences Between Postal Codes
Z-test on TotalPremium:

Z-statistic: -4.256 \
p-value: 2.077e-05\
Conclusion: "Reject the null hypothesis."\
Interpretation:

The Z-test is used to compare the means of two groups (in this case, postal codes) to see if there is a significant difference.
A Z-statistic of -4.256 means that the difference in TotalPremium between the two postal codes is large enough to be statistically significant.
The p-value of 2.08e-05 is far below the typical significance threshold of 0.05, leading to the rejection of the null hypothesis.
Conclusion: There is a statistically significant margin difference (in terms of TotalPremium) between the two postal codes being compared. ACIS could potentially optimize pricing strategies or focus marketing efforts differently in these postal codes.
#### 4. Risk Differences Between Women and Men
T-test on TotalPremium:

T-statistic: -0.828\
p-value: 0.408\
Conclusion: "Fail to reject the null hypothesis."\
Interpretation:

The T-test is used here to compare the means of TotalPremium between men and women.
A T-statistic of -0.828 indicates that the difference in means between the two groups is small.
The p-value of 0.408 is larger than the 0.05 significance threshold, meaning that we fail to reject the null hypothesis.
Conclusion: There is no statistically significant difference in risk (as measured by TotalPremium) between men and women. This suggests that gender does not play a significant role in determining risk, at least in terms of the total premium paid in this dataset.

### Overall Insights:
Provinces and Postal Codes: Both the chi-squared tests for provinces and postal codes show significant differences in risk across locations. This suggests that geographic location is a crucial factor in determining insurance risk, and ACIS should consider location-based strategies for pricing and marketing.

Postal Code Margins: The Z-test shows significant differences in margins (or premiums) between certain postal codes, reinforcing the idea that postal codes have distinct risk profiles.

Gender: The t-test suggests that there is no meaningful difference in risk between men and women based on TotalPremium, so gender-based pricing adjustments may not be necessary in this case.