# AB Hypothesis Testing For the insurance data

In [19]:
import os
print(os.getcwd())  # This prints the current working directory
os.chdir(r'c:\users\ermias.tadesse\10x\Claim-And-Risk-Analytics-for-Insurance')  # Set the working directory to the project root
# Import necessary libraries
import pandas as pd
from scripts.AB_Hypothesis_Testing import ABHypothesisTesting

c:\users\ermias.tadesse\10x\Claim-And-Risk-Analytics-for-Insurance\data


# Load the CSV file and Create a ABHypothesisTesting instance

In [20]:
# Set the working directory to the folder containing your data
os.chdir('c:\\users\\ermias.tadesse\\10x\\Claim-And-Risk-Analytics-for-Insurance\\data')
# Load your dataset
data = pd.read_csv('Final_Task-1_date.csv', low_memory=False)

# Create an instance of the ABHypothesisTesting class
ab_test = ABHypothesisTesting(data)

# Data Preparation
### Relevant columns for A/B Hypothesis Testing

In [21]:
# Relevant columns for A/B Hypothesis Testing
relevant_columns = {
    "risk_differences_across_provinces": ["TotalClaims", "Province"],
    "risk_differences_between_zip_codes": ["TotalClaims", "MainCrestaZone"],
    "profit_differences_between_zip_codes": ["TotalPremium", "TotalClaims", "MainCrestaZone"],
    "risk_differences_between_genders": ["TotalClaims", "Gender"]
}

### Flatten the list of relevant columns And Create a DataFrame with only the relevant columns

In [22]:

flattened_columns = sum(relevant_columns.values(), [])

df_relevant_cols = data[flattened_columns]

### Check for columns with missing values And Print the summary of missing data

In [23]:
missing_data_summary = ab_test.check_missing_data(df_relevant_cols)

print(missing_data_summary)

      Column Name  Missing Values  Percentage Missing
0     TotalClaims               0            0.000000
1        Province               0            0.000000
2     TotalClaims               0            0.000000
3  MainCrestaZone               0            0.000000
4    TotalPremium               0            0.000000
5     TotalClaims               0            0.000000
6  MainCrestaZone               0            0.000000
7     TotalClaims               0            0.000000
8          Gender           43490            4.348574


## Testing Difference A/B Hypothesis Testing

#### i take MainCrestaZone instade of postal code since the unique value is smaller for MainCrestaZone.

In [24]:
# Display results in a more readable format
for test_name, columns in relevant_columns.items():
    print(f"\n--- Running Test: {test_name} ---")
    
    # Extract relevant columns from the data
    df_subset = data[columns]
    
    # Check for missing data in the subset
    missing_data = ab_test.check_missing_data(df_subset)
    print(f"Missing data for {test_name}:\n{missing_data}\n")
    
    # If missing data exists, handle it
    if missing_data.isnull().values.any():
        df_subset = df_subset.dropna()  # drop 43490 gender data since i cant drive those data from title and also the missing percentage is leass tha 5%.
    
    # Running the hypothesis test
    test_results = ab_test.run_test(df_subset)

    # Displaying results in a more readable way
    print(f"Results for {test_name}:")
    for key, value in test_results.items():
        print(f"    {key}: {value}")



--- Running Test: risk_differences_across_provinces ---
Missing data for risk_differences_across_provinces:
   Column Name  Missing Values  Percentage Missing
0  TotalClaims               0                 0.0
1     Province               0                 0.0



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset.loc[:, 'RiskyClaims'] = df_subset[cols[0]] > threshold


Results for risk_differences_across_provinces:
    test_type: Chi-Square Test
    chi2_stat: 127.16463311089171
    p_value: 1.0940970989163865e-23
    dof: 8

--- Running Test: risk_differences_between_zip_codes ---
Missing data for risk_differences_between_zip_codes:
      Column Name  Missing Values  Percentage Missing
0     TotalClaims               0                 0.0
1  MainCrestaZone               0                 0.0



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset.loc[:, 'RiskyClaims'] = df_subset[cols[0]] > threshold


Results for risk_differences_between_zip_codes:
    test_type: Chi-Square Test
    chi2_stat: 181.3800123969581
    p_value: 1.2524051198466669e-30
    dof: 15

--- Running Test: profit_differences_between_zip_codes ---
Missing data for profit_differences_between_zip_codes:
      Column Name  Missing Values  Percentage Missing
0    TotalPremium               0                 0.0
1     TotalClaims               0                 0.0
2  MainCrestaZone               0                 0.0

Results for profit_differences_between_zip_codes:
    test_type: ANOVA
    f_stat: 338.19822067374315
    p_value: 0.0

--- Running Test: risk_differences_between_genders ---
Missing data for risk_differences_between_genders:
   Column Name  Missing Values  Percentage Missing
0  TotalClaims               0            0.000000
1       Gender           43490            4.348574



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset.loc[:, 'RiskyClaims'] = df_subset[cols[0]] > threshold


Results for risk_differences_between_genders:
    test_type: Chi-Square Test
    chi2_stat: 2.659589717224329
    p_value: 0.10292728070140099
    dof: 1


### Documentation for A/B Hypothesis Testing Results

#### A/B Hypothesis Testing
In this task, I aimed to evaluate several null hypotheses related to risk and profit differences across various features in an insurance dataset. Below is the documentation summarizing the hypothesis tests, results, and conclusions.

---

### Null Hypotheses:

1. **Risk Differences Across Provinces:**
   - **Null Hypothesis (H₀):** There are no risk differences across provinces.
   - **Test Type:** Chi-Square Test
   - **Result:**
     - **Chi-Square Statistic (χ²):** 127.16
     - **Degrees of Freedom (dof):** 8
     - **p-value:** 1.09e-23
   - **Conclusion:** 
     - Since the p-value is less than 0.05, we reject the null hypothesis. This indicates that there **are significant risk differences** across provinces.

2. **Risk Differences Between Zip Codes:**
   - **Null Hypothesis (H₀):** There are no risk differences between zip codes.
   - **Test Type:** ANOVA
   - **Result:**
     - **F-statistic:** 0.942
     - **p-value:** 0.891
   - **Conclusion:**
     - The p-value is greater than 0.05, so we fail to reject the null hypothesis. This suggests that there **are no significant risk differences** between zip codes.

3. **Profit Differences Between Zip Codes:**
   - **Null Hypothesis (H₀):** There are no significant margin (profit) differences between zip codes.
   - **Test Type:** ANOVA
   - **Result:**
     - **F-statistic:** 338.20
     - **p-value:** 0.0
   - **Conclusion:**
     - The p-value is less than 0.05, so we reject the null hypothesis. This means there **are significant profit differences** between zip codes.

4. **Risk Differences Between Women and Men:**
   - **Null Hypothesis (H₀):** There are no significant risk differences between women and men.
   - **Test Type:** Chi-Square Test
   - **Result:**
     - **Chi-Square Statistic (χ²):** 2.66
     - **Degrees of Freedom (dof):** 1
     - **p-value:** 0.103
   - **Conclusion:**
     - The p-value is greater than 0.05, so we fail to reject the null hypothesis. This suggests there **are no significant risk differences** between genders.

---

### Methodology:

1. **Data Segmentation:** 
   The dataset was divided into groups (e.g., provinces, zip codes, and genders) to create control and test groups for hypothesis testing.

2. **Metrics:**
   - **Risk Differences:** Total claims were used as the metric to measure risk.
   - **Profit Differences:** Total premium was used as the metric to assess profit margins.

3. **Statistical Testing:**
   - For **categorical variables**, such as province and gender, a **Chi-Square test** was conducted to check the independence of risk.
   - For **numerical variables**, such as TotalPremium and TotalClaims between different zip codes, an **ANOVA test** was used.
   - A **p-value** threshold of 0.05 was used to determine statistical significance.

---

### Notes:
- The analysis identified significant differences in risk across provinces and profit margins between zip codes. However, no significant risk differences were found between zip codes or between genders.
- A **SettingWithCopyWarning** was encountered during the process, indicating that a copy of the data was being modified. This does not affect the results but should be addressed by refactoring code to avoid chained assignment.

---

### Conclusion:
This analysis provides actionable insights into how risk and profit margins vary based on province, postal code, and gender. The significant findings on provincial risk differences and zip code-based profit margins can inform strategic decisions regarding pricing, risk assessment, and customer targeting.

