## Hypothesis Testing
A statistical hypothesis test is a formal procedure used to determine whether there is enough evidence in a sample of data to infer that a certain condition is true for the entire population. It helps in making data-driven decisions.

this notebook attempt to answer questions based on the distribution and relationships between different attributes in our dataset using Hypothesis Tests.

 The approach involves the following steps:
 - Define the null hypothesis(H₀) and the alternative hypothesis((H₁)).
 - Select a suitable statistical test along with the corresponding test statistic.
 - Choose a significance level α (commonly set at 0.05).
 - Compute the test statistic's value.
 - Calculate the probability of the test statistic 
 - Compare the resulting probability with the predetermined significance level.

In [1]:
import pandas as pd 
import numpy as np


In [2]:
import os
import sys

sys.path.append("../")

In [3]:
from scripts.hypothesis_testing import *

In [4]:
# import the cleaned dataset
df = pd.read_csv("../data/cleaned_insurance_data.csv")

  df = pd.read_csv("../data/cleaned_insurance_data.csv")


In [7]:
def print_test_results(result, risks):
    if 'error' in result:
        print(result['error'])
    else:
        print(f"Test type: {result['test_type']}")
        print(f"Statistic: {result['statistic']}")
        print(f"p-value: {result['p_value']}")
        print(result['interpretation'])
    print(f"Risks:\n{risks}\n")

### 1. Test for risk differences across provinces

- Null Hypothesis (H₀): There are no risk differences across provinces (interms of TotalPremium)
- Alternative Hypothesis (H₁): There is risk differences across provinces

In [13]:
 # Test for risk differences across provinces using anova test
print("1. Testing for risk differences across provinces")
province_risks = calculate_risk(df, 'Province', 'TotalPremium')
result = perform_statistical_test(df, 'Province', 'TotalPremium', 'anova')
print_test_results(result,province_risks)

1. Testing for risk differences across provinces
Test type: anova
Statistic: 308.5343902463087
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
Province
Eastern Cape     15.992497
Free State       18.286325
Gauteng          15.150832
KwaZulu-Natal    15.545857
Limpopo          17.632667
Mpumalanga       13.831484
North West       15.117804
Northern Cape    15.119254
Western Cape     12.857935
Name: TotalPremium, dtype: float64



In [10]:
 # Test for risk differences across provinces using chi_square test
print("1. Testing for risk differences across provinces")
province_risks = calculate_risk(df, 'Province', 'TotalPremium')
result = perform_statistical_test(df, 'Province', 'TotalPremium', 'chi_square')
print_test_results(result, province_risks)

1. Testing for risk differences across provinces
Test type: chi_square
Statistic: 1675877.9839662637
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
Province
Eastern Cape     15.992497
Free State       18.286325
Gauteng          15.150832
KwaZulu-Natal    15.545857
Limpopo          17.632667
Mpumalanga       13.831484
North West       15.117804
Northern Cape    15.119254
Western Cape     12.857935
Name: TotalPremium, dtype: float64



### 2. Test for risk differences between zip codes

- Null Hypothesis (H₀): TThere are no risk differences between zip codes(interms of TotalPremium)
- Alternative Hypothesis (H₁): There is risk differences between zip codes

In [11]:
#  Test for risk differences between zipcodes using anova test 
print("2. Testing for risk differences between zipcodes")
zipcode_risks = calculate_risk(df, 'PostalCode', 'TotalPremium')
result = perform_statistical_test(df, 'PostalCode', 'TotalPremium', 'anova')
print_test_results(result, zipcode_risks.nlargest(5))

2. Testing for risk differences between zipcodes
Test type: anova
Statistic: 45.97102609099751
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
PostalCode
284     43.859649
322     43.859649
331     43.859649
1807    43.859649
2210    43.859649
Name: TotalPremium, dtype: float64



In [16]:
df['Margin'] = df['TotalPremium'] - df['TotalClaims']

In [14]:
#  Test for risk differences between zipcodes using chi_square test 
print("2. Testing for risk differences between zipcodes")
zipcode_risks = calculate_risk(df, 'PostalCode', 'TotalPremium')
result = perform_statistical_test(df, 'PostalCode', 'TotalPremium', 'chi_square')
print_test_results(result, zipcode_risks.nlargest(5))

2. Testing for risk differences between zipcodes
Test type: chi_square
Statistic: 143288895.37515512
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
PostalCode
284     43.859649
322     43.859649
331     43.859649
1807    43.859649
2210    43.859649
Name: TotalPremium, dtype: float64



### 3. Test for margin (profit) differences between zip codes

- Null Hypothesis (H₀): There are no significant margin (profit) difference between zip codes
- Alternative Hypothesis (H₁): There is a significant margin (profit) difference between zip codes

In [17]:
#  Test for margin (profit) differences between zip codes using anova test 
print("3. Testing for margin differences between zip codes")
zipcode_margins = calculate_margin(df, 'PostalCode')
result = perform_statistical_test(df, 'PostalCode', 'Margin', 'anova')
print_test_results(result, zipcode_margins.nlargest(5))

3. Testing for margin differences between zip codes
Test type: anova
Statistic: 45.97102609099751
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
PostalCode
2000    2.281240e+06
122     6.682290e+05
299     4.235489e+05
7784    3.321872e+05
2196    2.474567e+05
dtype: float64



In [18]:
#  Test for margin (profit) differences between zip codes using chi_square test 
print("3. Testing for margin differences between zip codes")
zipcode_margins = calculate_margin(df, 'PostalCode')
result = perform_statistical_test(df, 'PostalCode', 'Margin', 'chi_square')
print_test_results(result, zipcode_margins.nlargest(5))

3. Testing for margin differences between zip codes
Test type: chi_square
Statistic: 143288895.37515512
p-value: 0.0
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
PostalCode
2000    2.281240e+06
122     6.682290e+05
299     4.235489e+05
7784    3.321872e+05
2196    2.474567e+05
dtype: float64



### 4. Test for risk differences between Women and Men

- Null Hypothesis (H₀): There is no significant difference in risk between males and females (in terms of TotalPremium).
- Alternative Hypothesis (H₁): There is a significant difference in risk between males and females.

In [19]:
# 4. Test for risk differences between Women and Men
print("4. Testing for risk differences between Women and Men")

filtered_df = df[df['Gender'].isin(['Male', 'Female'])]

gender_risks = calculate_risk(filtered_df, 'Gender', 'TotalPremium')
result = perform_statistical_test(filtered_df, 'Gender', 'TotalPremium', 't_test')
print_test_results(result, gender_risks)

4. Testing for risk differences between Women and Men
Test type: t_test
Statistic: 7.670958551805413
p-value: 1.7086864636026e-14
Reject the null hypothesis (p-value: 0.0000). There is a significant difference.
Risks:
Gender
Female    14.230912
Male      14.910986
Name: TotalPremium, dtype: float64

