### 10 Academy: Artificial Intelligence Mastery
#### Week 3 Challenge
##### A/B_Hypothesis_Testing
Ethel Cherotaw

A/B Hypothesis Testing is a method used to compare two groups to see if there is a statistically significant difference between them

In [1]:
import sys
import numpy as np
import os
import pandas as pd
from scipy.stats import ttest_ind, chi2_contingency
from scipy import stats

data_dir=data_dir = r'E:\2017.Study\Tenx\Week-3\Data\data'
src_dir = r'E:\2017.Study\Tenx\Week-3\Insurance\W3.Insurance-Planning.AIM2\src'


sys.path.append(src_dir)
sys.path.append(data_dir)
from AB_utils import InsuranceDataUtils

#### Loading Data for testing 

In [2]:
csv_file_path = r'E:\2017.Study\Tenx\Week-3\Data\data\cleaned_data.csv'
df = pd.read_csv(csv_file_path, low_memory=False)
df.head()


Unnamed: 0,UnderwrittenCoverID,PolicyID,TransactionMonth,IsVATRegistered,Citizenship,LegalType,Title,Language,Bank,AccountType,...,ExcessSelected,CoverCategory,CoverType,CoverGroup,Section,Product,StatutoryClass,StatutoryRiskType,TotalPremium,TotalClaims
0,145249,12827,2015-03-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
1,145249,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,21.929825,0.0
2,145249,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Windscreen,Windscreen,Windscreen,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0
3,145255,12827,2015-05-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,512.84807,0.0
4,145255,12827,2015-07-01 00:00:00,True,,Close Corporation,Mr,English,First National Bank,Current account,...,Mobility - Metered Taxis - R2000,Own damage,Own Damage,Comprehensive - Taxi,Motor Comprehensive,Mobility Metered Taxis: Monthly,Commercial,IFRS Constant,0.0,0.0


##### Checking list of province

In [3]:
provinces = df['Province'].unique()
provinces

array(['Gauteng', 'KwaZulu-Natal', 'Mpumalanga', 'Eastern Cape',
       'Western Cape', 'Limpopo', 'North West', 'Free State',
       'Northern Cape'], dtype=object)

##### A/B Hypothesis Testing 

Let's break down the dataset by merging the relevant data columns for A/B Hypothesis Testing analysis and then proceed through each step in detail.Since We will analyze risk and profit in the insurance dataset with the following metrics:

1. Risk Across Provinces: Assess average and total claims by Province to evaluate risk differences.
2. Risk by Zip Code: Compare average and total claims by ZipCode to explore risk variations.
3. Profit Margin by Zip Code: Calculate profit margins (Total Premium - Total Claims) by ZipCode to compare profitability.
4. Risk by Gender: Analyze total claims by Gender to identify risk differences between male and female policyholders. 
Merging those columns will be best for analysis 
    


In [4]:
# Lets use 'ZipCode' with 'PostalCode' because we dont have  a zip code columns 
df_subset = df[['Province', 'PostalCode', 'Gender', 'TotalPremium', 'TotalClaims','SumInsured']]

# Display the first few rows to ensure the data is correctly merged
df_subset.head()


Unnamed: 0,Province,PostalCode,Gender,TotalPremium,TotalClaims,SumInsured
0,Gauteng,1459,Not specified,21.929825,0.0,0.01
1,Gauteng,1459,Not specified,21.929825,0.0,0.01
2,Gauteng,1459,Not specified,0.0,0.0,0.01
3,Gauteng,1459,Not specified,512.84807,0.0,119300.0
4,Gauteng,1459,Not specified,0.0,0.0,119300.0


#### 1. Risk Differences Across Provinces: 
##### Metrics Selection : 
I used Total Claims because our null hypothesis was focused on identifying risk differences across provinces. Total Claims directly measures the amount of risk associated with insurance policies, making it the most appropriate metric for quantifying and comparing risk between regions. This metric aligns with our goal of understanding how the financial impact of claims varies across provinces. In contrast, Total Premium measures revenue, not risk, as it reflects income rather than the actual cost of claims. Margin, representing the difference between Total Premium and Total Claims, indicates profitability but does not directly measure risk; it can be affected by factors such as pricing strategies or operational costs. Therefore, Total Claims is the most relevant metric for assessing risk differences, as it directly reflects the financial impact of claims and provides insights into the frequency and severity of claims across different provinces. Analyzing Total Claims helps identify significant risk variations, whereas Total Premium offers insights into revenue and risk-related pricing.

For:
##### Data Segmentation  
I used Stratified Selection by categorizing provinces into two groups based on the average Total Claims:

High Risk: Provinces with above-average Total Claims.
Low Risk: Provinces with below-average Total Claims.
##### Statistical Testing
I used t-test because for comparing means (total claim) t-test is appropriate 

In [5]:
    # Create instance of DataUtil
data_util = InsuranceDataUtils(df_subset)
    
    # Categorize provinces
high_risk_provinces, low_risk_provinces = data_util.categorize_provinces('TotalClaims')
print(f"High Risk Provinces: {high_risk_provinces}")
print(f"Low Risk Provinces: {low_risk_provinces}")
    
    # Perform the test
t_stat, p_val, interpretation = data_util.test_risk_differences('TotalClaims', high_risk_provinces, low_risk_provinces)
    
    # Print summary
data_util.print_summary(t_stat, p_val, interpretation)

High Risk Provinces: ['Gauteng', 'KwaZulu-Natal', 'Western Cape']
Low Risk Provinces: ['Eastern Cape', 'Free State', 'Limpopo', 'Mpumalanga', 'North West', 'Northern Cape']
T-statistic: 7.1185
P-value: 0.0000
Interpretation: Reject the null hypothesis: There are significant differences in risk.


Answer: Null Hpo Rejected and 
      : There are significant differences in risk.

#### 2. There are no risk differences between zip codes 
Metric Selection : Again I use Total Claims as a metric for evaluating risk differences between zip codes(Postcode).
because:-Total Claims represent the actual financial impact due to claims. It's a direct measure of risk, which aligns with the objective of understanding risk differences. Total Premium and Margin might not directly reflect the risk but rather revenue or profit.

#### Data Segmentation  

Using risk catagories using percentile Method: by dividing the postal codes based on percentiles (e.g., top 25% as high-risk).

In [6]:
# Example usage
# Assuming df is your DataFrame
analysis = InsuranceDataUtils(df_subset)
analysis.analyze(threshold=10)


Category Claims:
    PostalCodeCategory    TotalClaims
0                    1  307583.342105
1                    2   61885.298246
2                    4       0.000000
3                    5   82951.526316
4                    6    8628.596491
..                 ...            ...
847               9830       0.000000
848               9868       0.000000
849               9869    2236.842105
850               9870       0.000000
851              Other       0.000000

[852 rows x 2 columns]
Chi-Square Test Results:
{'Chi-Square Statistic': 1214174.4063742852, 'P-Value': 1.0, 'Degrees of Freedom': 1373514, 'Expected Frequencies': array([[5.34047663e-03, 5.34047663e-03, 5.34047663e-03, ...,
        5.34047663e-03, 5.34047663e-03, 5.34047663e-03],
       [1.48785419e-03, 1.48785419e-03, 1.48785419e-03, ...,
        1.48785419e-03, 1.48785419e-03, 1.48785419e-03],
       [7.69924547e-05, 7.69924547e-05, 7.69924547e-05, ...,
        7.69924547e-05, 7.69924547e-05, 7.69924547e-05],
       .

The Chi-Square statistic was 1,214,174.41, indicating a noticeable deviation between observed and expected claim frequencies. However, the p-value was 1.0, suggesting that the observed differences could be entirely due to random chance. Since the p-value is much greater than the common significance threshold of 0.05, we fail to reject the null hypothesis, indicating no statistically significant relationship between postal code categories and insurance claims. The degrees of freedom were 1,373,514, reflecting the dataset's complexity with numerous postal codes and claim amounts. Expected frequencies were very small, hinting at a highly uneven distribution of claims across postal codes. The results suggest that differences in total claims are likely due to random variation rather than an association with postal codes. This outcome may be attributed to the low number of claims per category and the high number of postal code categories, which can weaken the test's ability to detect significant differences.

The Chi-Square test shows that there are no significant differences in insurance claims based on postal codes. The lack of statistical significance suggests that postal codes, as categorized here, do not explain the variation in claim amounts, and other factors may better account for insurance risk.

This means you should explore additional features beyond postal codes for understanding claim patterns, such as demographic or behavioral data, or perhaps aggregate postal codes differently to gain more insight.


3. There are no significant margin (profit) difference between zip codes 


In [7]:
insurance = InsuranceDataUtils(df_subset)
insurance.analyze_margins_by_postal_code(threshold=10)

Category Margins: 
    PostalCodeCategory      Margin
0                    1   -6.468455
1                    2   -0.687882
2                    4  113.947737
3                    5 -145.725189
4                    6   30.980985
..                 ...         ...
847               9830  131.760895
848               9868  116.042377
849               9869   43.193631
850               9870   80.469292
851              Other   70.769225

[852 rows x 2 columns]
ANOVA Test Results:
{'F-Statistic': 0.9075757372814504, 'P-Value': 0.9745048985380578}
Interpretation: Fail to reject the null hypothesis: No significant differences in margin between postal codes.


The F-statistic measures the variance between categories relative to the variance within each category. In this case, the F-statistic is 0.9076, which is relatively low. The p-value is 0.9745, significantly higher than the typical alpha level of 0.05. This high p-value suggests that the observed differences in margins are likely due to random chance rather than actual differences between postal code categories. Therefore, we fail to reject the null hypothesis, indicating no statistically significant difference in margins across postal codes. The variations in average margins are likely random and do not reflect meaningful differences in profit margins.

#### 4. There are not significant risk difference between Women and Men
Metric Selection: Total Claim 

Data Segmentation:  calculating the mean of TotalClaims for each gender.

Control Group: Male
Testing Group: Female


Statistical Testing: t-test

In [8]:
analysis = InsuranceDataUtils(df_subset)

analysis.preprocess_and_calculate()
analysis.perform_t_test_and_interpret()


Average Claims by Gender:
Gender
Female    37.046055
Male      32.620312
Name: TotalClaims, dtype: float64
T-Test Results:
{'T-Statistic': -0.296353891400699, 'P-Value': 0.7669656471629474}
Interpretation: Fail to reject the null hypothesis: No significant differences in risk between genders.


P-Value: The p-value of 0.77 is significantly higher than the common alpha level of 0.05.
T-Statistic: The t-statistic is close to zero, suggesting a small difference between the two groups.
Conclusion: Fail to reject the null hypothesis. The data does not show a statistically significant difference in the average claims between female and male policyholders.

This means that based on the current data, the risk associated with claims does not significantly differ between genders.

#### Reference 
1. https://www.kdnuggets.com/hypothesis-testing-and-ab-testing#:~:text=A%2FB%20testing%20is%20a,determine%20which%20one%20performs%20better.
2. https://www.abtasty.com/blog/formulate-ab-test-hypothesis/