# Task 3: A/B Hypothesis Testing

### Phase 1: Define the Metrics

**Objective:** Choose KPIs to measure the impact of the features.

- **Risk Differences Across Provinces:** Average Risk Score per Province
- **Risk Differences Between Zip Codes:** Average Risk Score per Zip Code
- **Margin Difference Between Zip Codes:** Average Profit Margin per Zip Code
- **Risk Differences Between Women and Men:** Average Risk Score by Gender

### Phase 2: Data Segmentation

**Objective:** Divide data into control and test groups.

- **Group A (Control):** Without the feature or baseline feature.
- **Group B (Test):** With the feature or variant feature.
- **For Multi-Class Features:** Ensure groups are comparable except for the feature tested.

### Phase 3: Statistical Testing

**Objective:** Conduct tests to evaluate feature impact.

- **Categorical Data:** Chi-Squared Test
- **Numerical Data:** T-Tests or Z-Tests
### Phase 4: Analyze the p-Value
**Objective**: Determine if the null hypothesis should be rejected.

Reject Null Hypothesis: If p_value < 0.05   
Fail to Reject Null Hypothesis: If p_value >= 0.05

### Phase 5: Analyze and Report  
**Objective**: Document and interpret findings.

Summarize Results: Document test results and p-values.
Interpret Results: Explain findings relative to business goals.
Report Findings: Create a report with visualizations and explanations.


### Phase 1: Define the Metrics

**Objective:** Choose KPIs to measure the impact of the features.

- **Risk Differences Across Provinces:** Average Risk Score per Province
- **Risk Differences Between Zip Codes:** Average Risk Score per Zip Code
- **Margin Difference Between Zip Codes:** Average Profit Margin per Zip Code
- **Risk Differences Between Women and Men:** Average Risk Score by Gender

In [9]:
import pandas as pd
df=pd.read_csv('dataset.csv')

  df=pd.read_csv('dataset.csv')


In [11]:
print(df.columns.tolist())

['UnderwrittenCoverID', 'PolicyID', 'TransactionMonth', 'IsVATRegistered', 'Citizenship', 'LegalType', 'Title', 'Language', 'Bank', 'AccountType', 'MaritalStatus', 'Gender', 'Country', 'Province', 'PostalCode', 'MainCrestaZone', 'SubCrestaZone', 'ItemType', 'mmcode', 'VehicleType', 'RegistrationYear', 'make', 'Model', 'Cylinders', 'cubiccapacity', 'kilowatts', 'bodytype', 'NumberOfDoors', 'VehicleIntroDate', 'CustomValueEstimate', 'AlarmImmobiliser', 'TrackingDevice', 'CapitalOutstanding', 'NewVehicle', 'WrittenOff', 'Rebuilt', 'Converted', 'CrossBorder', 'NumberOfVehiclesInFleet', 'SumInsured', 'TermFrequency', 'CalculatedPremiumPerTerm', 'ExcessSelected', 'CoverCategory', 'CoverType', 'CoverGroup', 'Section', 'Product', 'StatutoryClass', 'StatutoryRiskType', 'TotalPremium', 'TotalClaims']


In [12]:
import pandas as pd

# Calculate Average Risk Score per Province
# Assuming 'TotalClaims' represents the risk score
avg_risk_province = df.groupby('Province')['TotalClaims'].mean()

# Calculate Average Risk Score per Zip Code
avg_risk_zipcode = df.groupby('PostalCode')['TotalClaims'].mean()

# Calculate Average Profit Margin per Zip Code
# Assuming 'CalculatedPremiumPerTerm' represents the profit margin
avg_profit_margin_zipcode = df.groupby('PostalCode')['CalculatedPremiumPerTerm'].mean()

# Calculate Average Risk Score by Gender
avg_risk_gender = df.groupby('Gender')['TotalClaims'].mean()

# Print the KPIs
print("Average Risk Score per Province:")
print(avg_risk_province)
print("\nAverage Risk Score per Zip Code:")
print(avg_risk_zipcode)
print("\nAverage Profit Margin per Zip Code:")
print(avg_profit_margin_zipcode)
print("\nAverage Risk Score by Gender:")
print(avg_risk_gender)


Average Risk Score per Province:
Province
Eastern Cape     2.564573
Free State       1.946287
Gauteng          2.810492
KwaZulu-Natal    2.059208
Limpopo          4.081468
Mpumalanga       4.345000
North West       2.676528
Northern Cape    2.230747
Western Cape     1.462408
Name: TotalClaims, dtype: float64

Average Risk Score per Zip Code:
PostalCode
1        2.169499
2        0.000000
4        0.000000
5        0.000000
6        2.325890
          ...    
9781    17.358257
9830     0.000000
9868     0.000000
9869     2.289501
9870     0.000000
Name: TotalClaims, Length: 842, dtype: float64

Average Profit Margin per Zip Code:
PostalCode
1       27.266745
2       30.791998
4       19.938561
5       17.842110
6       22.474850
          ...    
9781    26.059548
9830     9.298520
9868     8.685740
9869    29.772178
9870    14.882550
Name: CalculatedPremiumPerTerm, Length: 842, dtype: float64

Average Risk Score by Gender:
Gender
0                0.000000
Female           6.675169
Male

### Phase 2: Data segmentation

In [15]:
import pandas as pd


def segment_data(df, feature, test_categories):
    """
    Segments the data into control and test groups based on the provided feature.
    
    Args:
    - df (pd.DataFrame): The DataFrame containing the data.
    - feature (str): The feature column to segment the data on.
    - test_categories (list): The categories to use for the test group.
    
    Returns:
    - control_df (pd.DataFrame): The DataFrame for the control group.
    - test_df (pd.DataFrame): The DataFrame for the test group.
    """
    # Group B (Test): With the feature or variant feature
    test_df = df[df[feature].isin(test_categories)]
    
    # Group A (Control): Without the feature or baseline feature
    control_df = df[~df[feature].isin(test_categories)]
    
    return control_df, test_df

# Example 1: Segment by Gender
# Control Group: Male
# Test Group: Female
control_gender = ['Male']
test_gender = ['Female']

control_df_gender, test_df_gender = segment_data(df, 'Gender', test_gender)

# Example 2: Segment by Province (Coastal vs Inland)
# Coastal Provinces
coastal_provinces = ['Eastern Cape', 'KwaZulu-Natal', 'Western Cape']

# Inland Provinces
inland_provinces = ['Free State', 'Gauteng', 'Limpopo', 'Mpumalanga', 'North West', 'Northern Cape']

# Segmentation
control_df_provinces, test_df_provinces = segment_data(df, 'Province', coastal_provinces)  # Coastal for Test
# Alternatively, use inland_provinces for Test group and coastal_provinces for Control

# Example 3: Segment by PostalCode (Zip Codes)
# Use as many zip codes as possible for a meaningful comparison
# Using a subset of zip codes for demonstration (update as needed)
zip_code_counts = df['PostalCode'].value_counts()
total_zip_codes = zip_code_counts.index.tolist()

# For a balanced comparison, select a subset of zip codes
# Here, we will use the top 100 most frequent zip codes as an example
num_zip_codes = 100
top_zip_codes = zip_code_counts.head(num_zip_codes).index.tolist()

# Split the zip codes into control and test groups
halfway_point = len(top_zip_codes) // 2
control_zip_codes = top_zip_codes[:halfway_point]  # First half
test_zip_codes = top_zip_codes[halfway_point:]     # Second half

control_df_zip, test_df_zip = segment_data(df, 'PostalCode', test_zip_codes)

# Print shapes of the segmented DataFrames
print("Control Group (Gender) Shape:", control_df_gender.shape)
print("Test Group (Gender) Shape:", test_df_gender.shape)
print("Control Group (Provinces) Shape:", control_df_provinces.shape)
print("Test Group (Provinces) Shape:", test_df_provinces.shape)
print("Control Group (Zip Codes) Shape:", control_df_zip.shape)
print("Test Group (Zip Codes) Shape:", test_df_zip.shape)


Control Group (Gender) Shape: (583144, 52)
Test Group (Gender) Shape: (3056, 52)
Control Group (Provinces) Shape: (377071, 52)
Test Group (Provinces) Shape: (209129, 52)
Control Group (Zip Codes) Shape: (511156, 52)
Test Group (Zip Codes) Shape: (75044, 52)


### Phase 3: Statistical testing
### Phase 3: Statistical Testing

**Objective:** Conduct tests to evaluate the impact of the features on the key performance indicators.

- **Chi-Squared Test for Categorical Data:**
  - **Feature:** Gender
  - **Purpose:** To determine if there is a significant difference in the distribution of gender between the control and test groups.
  - **Code:** The `chi_squared_test` function combines the control and test groups into a contingency table and performs a Chi-Squared test.

- **T-Test for Numerical Data:**
  - **Feature 1:** Risk Score across Provinces
  - **Feature 2:** Margin Difference between Zip Codes
  - **Purpose:** To determine if there are significant differences in numerical metrics (like risk score or margin) between the control and test groups.
  - **Code:** The `t_test` function performs a T-Test to compare the means of the numerical features between the control and test groups.

**Results Interpretation:**
- For the Chi-Squared Test, a p-value < 0.05 indicates a significant difference in gender distribution.
- For the T-Test, a p-value < 0.05 indicates a significant difference in the risk score or margin between the control and test groups.


In [19]:
import pandas as pd
from scipy import stats
from scipy.stats import chi2_contingency

# Phase 1: Data Segmentation (no changes)

def segment_data(df, feature, test_categories):
    """
    Segments the data into control and test groups based on the provided feature.
    
    Args:
    - df (pd.DataFrame): The DataFrame containing the data.
    - feature (str): The feature column to segment the data on.
    - test_categories (list): The categories to use for the test group.
    
    Returns:
    - control_df (pd.DataFrame): The DataFrame for the control group.
    - test_df (pd.DataFrame): The DataFrame for the test group.
    """
    test_df = df[df[feature].isin(test_categories)]
    control_df = df[~df[feature].isin(test_categories)]
    return control_df, test_df

# Segment by Gender
control_gender = ['Male']
test_gender = ['Female']
control_df_gender, test_df_gender = segment_data(df, 'Gender', test_gender)

# Segment by Province (Coastal vs Inland)
coastal_provinces = ['Eastern Cape', 'KwaZulu-Natal', 'Western Cape']
inland_provinces = ['Free State', 'Gauteng', 'Limpopo', 'Mpumalanga', 'North West', 'Northern Cape']
control_df_provinces, test_df_provinces = segment_data(df, 'Province', coastal_provinces)

# Segment by PostalCode (Zip Codes)
zip_code_counts = df['PostalCode'].value_counts()
top_zip_codes = zip_code_counts.head(100).index.tolist()
halfway_point = len(top_zip_codes) // 2
control_zip_codes = top_zip_codes[:halfway_point]  # Control group (first half of zip codes)
test_zip_codes = top_zip_codes[halfway_point:]     # Test group (second half of zip codes)
control_df_zip, test_df_zip = segment_data(df, 'PostalCode', test_zip_codes)

# Phase 2: Hypothesis Testing

# Function to perform Chi-squared test for categorical variables
def chi_square_test(control_df, test_df, categorical_feature):
    """
    Performs a Chi-squared test for categorical data.
    
    Args:
    - control_df (pd.DataFrame): Control group DataFrame.
    - test_df (pd.DataFrame): Test group DataFrame.
    - categorical_feature (str): The categorical feature to test.
    
    Returns:
    - chi2_stat (float): Chi-squared statistic.
    - p_value (float): p-value of the test.
    """
    control_counts = control_df[categorical_feature].value_counts()
    test_counts = test_df[categorical_feature].value_counts()

    # Create a contingency table
    contingency_table = pd.DataFrame([control_counts, test_counts])
    contingency_table.fillna(0, inplace=True)

    # Perform Chi-squared test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    return chi2_stat, p_value

# Function to perform T-test/Z-test for numerical variables
def t_test(control_df, test_df, numeric_feature):
    """
    Performs a T-test or Z-test for numerical data.
    
    Args:
    - control_df (pd.DataFrame): The DataFrame for the control group.
    - test_df (pd.DataFrame): The DataFrame for the test group.
    - numeric_feature (str): The numerical feature column to test.
    
    Returns:
    - t_stat (float): The test statistic from the T-test or Z-test.
    - p_value (float): The p-value from the test.
    """
    # For large sample size (n > 30), use Z-test approximation (normal distribution)
    if len(control_df) > 30 and len(test_df) > 30:
        mean_diff = control_df[numeric_feature].mean() - test_df[numeric_feature].mean()
        pooled_std = (control_df[numeric_feature].std()**2 / len(control_df)) + (test_df[numeric_feature].std()**2 / len(test_df))
        z_stat = mean_diff / pooled_std**0.5
        p_value = stats.norm.sf(abs(z_stat)) * 2  # Two-tailed
        return z_stat, p_value
    else:
        # For smaller sample sizes, use T-test
        t_stat, p_value = stats.ttest_ind(control_df[numeric_feature].dropna(), test_df[numeric_feature].dropna(), equal_var=False)
        return t_stat, p_value

# Hypothesis 1: Test risk differences across provinces (Numerical, so we use T-test/Z-test)
t_stat_provinces, p_value_provinces = t_test(control_df_provinces, test_df_provinces, 'TotalClaims')
print(f"T-Test/Z-Test result for Risk Score across Provinces: t-statistic = {t_stat_provinces}, p-value = {p_value_provinces}")
if p_value_provinces < 0.05:
    print("Reject the null hypothesis: There are risk differences across provinces.")
else:
    print("Accept the null hypothesis: There are no risk differences across provinces.")

# Hypothesis 2: Test risk differences across zip codes (Numerical, so we use T-test/Z-test)
t_stat_zip, p_value_zip = t_test(control_df_zip, test_df_zip, 'TotalClaims')
print(f"T-Test/Z-Test result for Risk Score across Zip Codes: t-statistic = {t_stat_zip}, p-value = {p_value_zip}")
if p_value_zip < 0.05:
    print("Reject the null hypothesis: There are risk differences between zip codes.")
else:
    print("Accept the null hypothesis: There are no risk differences between zip codes.")

# Hypothesis 3: Test margin (profit) difference across zip codes (Numerical, so we use T-test/Z-test)
t_stat_profit_zip, p_value_profit_zip = t_test(control_df_zip, test_df_zip, 'CalculatedPremiumPerTerm')
print(f"T-Test/Z-Test result for Profit Margin across Zip Codes: t-statistic = {t_stat_profit_zip}, p-value = {p_value_profit_zip}")
if p_value_profit_zip < 0.05:
    print("Reject the null hypothesis: There are significant profit margin differences between zip codes.")
else:
    print("Accept the null hypothesis: There are no significant profit margin differences between zip codes.")

# Hypothesis 4: Test risk difference between Men and Women (Categorical, so we use Chi-squared test)
chi2_stat_gender, p_value_gender = chi_square_test(control_df_gender, test_df_gender, 'TotalClaims')
print(f"Chi-squared test result for Risk Score between Genders: chi2 statistic = {chi2_stat_gender}, p-value = {p_value_gender}")
if p_value_gender < 0.05:
    print("Reject the null hypothesis: There are significant risk differences between Men and Women.")
else:
    print("Accept the null hypothesis: There are no significant risk differences between Men and Women.")


T-Test/Z-Test result for Risk Score across Provinces: t-statistic = 4.471245314646858, p-value = 7.776543233672052e-06
Reject the null hypothesis: There are risk differences across provinces.
T-Test/Z-Test result for Risk Score across Zip Codes: t-statistic = -1.4013358801895268, p-value = 0.16111365652295284
Accept the null hypothesis: There are no risk differences between zip codes.
T-Test/Z-Test result for Profit Margin across Zip Codes: t-statistic = -7.175048999136269, p-value = 7.228129338355045e-13
Reject the null hypothesis: There are significant profit margin differences between zip codes.
Chi-squared test result for Risk Score between Genders: chi2 statistic = 333.8154023780881, p-value = 0.039002833626968145
Reject the null hypothesis: There are significant risk differences between Men and Women.


### Phase 5: Analyze and Report  
**Objective**: Document and interpret findings.

The results from the statistical tests show that there are significant risk differences across provinces, as indicated by a t-statistic of 4.47 and a p-value of 7.78e-06. This p-value is far below the significance threshold of 0.05, leading us to reject the null hypothesis. This suggests that risk scores differ significantly between the various provinces. However, the analysis of risk scores across zip codes does not show a significant difference, with a t-statistic of -1.40 and a p-value of 0.16. Since the p-value is greater than 0.05, we accept the null hypothesis, meaning there is no substantial variation in risk across different zip codes.

When evaluating profit margins across zip codes, the results reveal a highly significant difference, with a t-statistic of -7.18 and an extremely low p-value of 7.23e-13, leading to the rejection of the null hypothesis. This indicates that profit margins vary significantly between different zip codes. Additionally, the Chi-squared test for risk differences between genders yields a chi-squared statistic of 333.82 and a p-value of 0.039, which is below the 0.05 threshold. This leads to the rejection of the null hypothesis, suggesting that there are significant differences in risk scores between men and women.
