| Null Hypothesis                                               | Test Goal                                                     |
| ------------------------------------------------------------- | ------------------------------------------------------------- |
| **H₀₁**: No risk difference across provinces                  | Check if Claim Frequency or Severity differs across provinces |
| **H₀₂**: No risk difference across zip codes                  | Same as above for PostalCode                                  |
| **H₀₃**: No significant margin difference between zip codes   | Mean of (TotalPremium - TotalClaims)                          |
| **H₀₄**: No significant risk difference between men and women | Gender-based difference in claims                             |


In [2]:
import pandas as pd

df = pd.read_csv("converted.csv")

# Ensure TotalClaims and TotalPremium are numeric
df['TotalClaims'] = pd.to_numeric(df['TotalClaims'], errors='coerce')
df['TotalPremium'] = pd.to_numeric(df['TotalPremium'], errors='coerce')
df.dropna(subset=['TotalClaims', 'TotalPremium'], inplace=True)

# Create margin column
df['Margin'] = df['TotalPremium'] - df['TotalClaims']

# Create binary column for ClaimOccurred
df['ClaimOccurred'] = (df['TotalClaims'] > 0).astype(int)


  df = pd.read_csv("converted.csv")


In [3]:
import scipy.stats as stats

# Claim frequency
province_freq = df.groupby('Province')['ClaimOccurred'].mean()

# Chi-squared test for independence
contingency = pd.crosstab(df['Province'], df['ClaimOccurred'])
chi2, p, dof, _ = stats.chi2_contingency(contingency)
print("Chi-Squared test for Claim Frequency by Province")
print(f"p-value: {p:.4f}")


Chi-Squared test for Claim Frequency by Province
p-value: 0.0000


In [4]:
# ANOVA for claim severity where claim > 0
severity_df = df[df['TotalClaims'] > 0]
anova_result = stats.f_oneway(
    *[group["TotalClaims"].values for name, group in severity_df.groupby("Province")]
)
print("ANOVA for Claim Severity by Province")
print(f"p-value: {anova_result.pvalue:.4f}")


ANOVA for Claim Severity by Province
p-value: 0.0000


In [5]:
zipcode_freq = df.groupby('PostalCode')['ClaimOccurred'].mean()
zipcode_severity = df[df['TotalClaims'] > 0].groupby('PostalCode')['TotalClaims'].mean()

# Use ANOVA on severity
anova_zipcode = stats.f_oneway(
    *[group["TotalClaims"].values for name, group in severity_df.groupby("PostalCode")]
)
print("ANOVA for Claim Severity by Zipcode")
print(f"p-value: {anova_zipcode.pvalue:.4f}")


ANOVA for Claim Severity by Zipcode
p-value: 0.0335


In [6]:
margin_by_zipcode = df.groupby('PostalCode')['Margin'].mean()

# ANOVA
anova_margin = stats.f_oneway(
    *[group["Margin"].values for name, group in df.groupby("PostalCode")]
)
print("ANOVA for Margin by Zipcode")
print(f"p-value: {anova_margin.pvalue:.4f}")


ANOVA for Margin by Zipcode
p-value: 0.9977


In [7]:
# Claim Frequency
male = df[df["Gender"] == "Male"]["ClaimOccurred"]
female = df[df["Gender"] == "Female"]["ClaimOccurred"]

# Use t-test
t_stat, p_value = stats.ttest_ind(male, female)
print("T-test for Claim Frequency by Gender")
print(f"p-value: {p_value:.4f}")


T-test for Claim Frequency by Gender
p-value: 0.8405


| Hypothesis Code | Description                                                         | p-value    | Decision             | Business Insight                                                                 |
| --------------- | ------------------------------------------------------------------- | ---------- | -------------------- | -------------------------------------------------------------------------------- |
| **H₀₁**         | No significant difference in **claim risk** across **Provinces**    | **0.000**  | ❌ **Reject H₀₁**     | Significant claim frequency variation across provinces. Risk varies by region.   |
| **H₀₂**         | No significant difference in **claim severity** across **Zipcodes** | **0.0335** | ❌ **Reject H₀₂**     | Claim severity depends on postal region. Targeted risk pricing is needed.        |
| **H₀₃**         | No significant difference in **margins** across **Zipcodes**        | **0.997**  | ✅ **Fail to Reject** | Margins are uniform. No zip code is significantly more or less profitable.       |
| **H₀₄**         | No significant difference in **claim frequency** by **Gender**      | **0.804**  | ✅ **Fail to Reject** | Gender does not impact risk. Gender-based premium segmentation is not justified. |

