### Task 2 - BAN427
##### A look into an A/B test on a online claim form from an insurance company
 


In [None]:
# Setup

import pandas as pd
import numpy as np

# Statistical tests
from scipy.stats import (
    chi2_contingency,
    f_oneway,
    kruskal,
    levene, 
    ks_2samp
)

pd.set_option("display.max_rows", 20)
pd.set_option("display.float_format", lambda x: f"{x:,.3f}")


In [3]:

# Load and display dataset A/B test

AB_test = pd.read_csv("C:/Users/hpilv/Documents/Skole/NHH/4AR/Seminarer/BAN427_InsuranceAnalytics/exam_ab_ex.csv")

AB_test.head()

Unnamed: 0,man,birthyear,claim,number_items_reported,Impression
0,0,1977,110.0,1,2
1,0,1987,12.0,1,2
2,1,1982,628.0,1,0
3,1,1968,503.6,1,0
4,1,1980,807.8,3,2


#### Data Dictionary & Assumptions

- man = 1 if male, 0 if female
- birthyear = Year of birth
- claim = Claim amount (outcome)
- number_items_reported = Items reported
- Impression = randomized AB arm (treatment)

- Assumptions & limits: one row = one claim; no customer ID → can’t test clustering; no timestamps → can’t test seasonality/rollout effects.

### Task 1A - Validity
Randomized experiments in business are crucial to disentangle causality from correlation.

Valid A/B testing requires:
- Random allocation (so groups are statistically comparable).
- Sufficient sample size (power to detect differences).
- No contamination / spillover effects between groups.
- Stable unit treatment value assumption (SUTVA) – outcome of one individual should not depend on treatment of others.

Our lectures emphasizes that observational data shows associations, not causality, so only randomization (or natural experiments) can establish causal impact

Therefore, we will check balance in gender, age, claim history across groups to confirm randomization worked.

In [4]:
# Check the dataset for outliers and sanity-check
# Print min and max for all columns in the dataset
min_values = AB_test.min(numeric_only=True)
max_values = AB_test.max(numeric_only=True)

print("Minimum values:\n", min_values)
print("\nMaximum values:\n", max_values)

# Unique values for 'man' and 'Impression'
print("Unique values in 'man':", AB_test["man"].unique())
print("Unique values in 'Impression':", AB_test["Impression"].unique())

# Allocation
alloc_counts = AB_test["Impression"].value_counts().sort_index()
alloc_props  = (alloc_counts / len(AB_test)).round(3)
expected_prop = 1 / alloc_counts.size
print("Counts:\n", alloc_counts, "\n")
print("Proportions:\n", alloc_props, "\n")
print(f"Max |dev| from equal split (~{expected_prop:.3f}): {((alloc_props-expected_prop).abs().max()):.3f}")

# Missingness & duplicates
print("\nMissing values per column:\n", AB_test.isna().sum(), "\n")
print("Duplicates (full-row):", AB_test.duplicated().sum())

# Missingness not related to treatment
for col in ["man", "birthyear", "number_items_reported"]:
    miss_flag = AB_test[col].isna()
    ct = pd.crosstab(AB_test["Impression"], miss_flag)
    if ct.shape[1] == 2:
        chi2, p, _, _ = chi2_contingency(ct)
        print(f"Missingness by Impression for {col}: p={p:.4f}")

Minimum values:
 man                         0.000
birthyear               1,934.000
claim                       0.000
number_items_reported       1.000
Impression                  0.000
dtype: float64

Maximum values:
 man                         1.000
birthyear               2,003.000
claim                   6,900.000
number_items_reported      19.000
Impression                  3.000
dtype: float64
Unique values in 'man': [0 1]
Unique values in 'Impression': [2 0 3 1]
Counts:
 Impression
0    374
1    356
2    375
3    395
Name: count, dtype: int64 

Proportions:
 Impression
0   0.249
1   0.237
2   0.250
3   0.263
Name: count, dtype: float64 

Max |dev| from equal split (~0.250): 0.013

Missing values per column:
 man                      0
birthyear                0
claim                    0
number_items_reported    0
Impression               0
dtype: int64 

Duplicates (full-row): 2


In [5]:
ct_man = pd.crosstab(AB_test["Impression"], AB_test["man"])
chi2, p, dof, _ = chi2_contingency(ct_man)
n = ct_man.to_numpy().sum()
r, c = ct_man.shape
cramers_v = np.sqrt(chi2 / (n * (min(r, c) - 1)))

print(ct_man, "\n")
print(f"Chi-square={chi2:.3f}, df={dof}, p={p:.4f}")
print(f"Cramér's V={cramers_v:.3f} (≈0 none, ~0.1 small, ~0.3 med, ~0.5 large)")


man           0    1
Impression          
0           204  170
1           200  156
2           224  151
3           237  158 

Chi-square=3.352, df=3, p=0.3405
Cramér's V=0.047 (≈0 none, ~0.1 small, ~0.3 med, ~0.5 large)


In [6]:
baseline_num = ["birthyear", "number_items_reported"]

def eta_squared_oneway(col, groups_dict):
    overall = AB_test[col].mean()
    ss_between = sum(len(g) * (g.mean() - overall)**2 for g in groups_dict.values())
    ss_total = ((AB_test[col] - overall)**2).sum()
    return ss_between / ss_total if ss_total > 0 else np.nan

def epsilon_squared_kruskal(H, k, n_):
    return (H - k + 1) / (n_ - k) if (n_ - k) > 0 else np.nan

for col in baseline_num:
    groups = {imp: vals[col].dropna() for imp, vals in AB_test.groupby("Impression")}
    glist = list(groups.values())
    k = len(glist); n_total = sum(len(g) for g in glist)

    desc = AB_test.groupby("Impression")[col].agg(
        n="count", mean="mean", std="std", median="median",
        q25=lambda s: s.quantile(0.25), q75=lambda s: s.quantile(0.75)
    )
    desc["IQR"] = desc["q75"] - desc["q25"]
    print(f"\n--- {col} descriptives ---\n", desc[["n","mean","std","median","IQR"]].round(3))

    # Levene, ANOVA, Kruskal–Wallis
    lev_s, lev_p = levene(*glist)
    a_s, a_p = (f_oneway(*glist) if all(len(g)>1 for g in glist) else (np.nan, np.nan))
    eta2 = eta_squared_oneway(col, groups) if np.isfinite(a_s) else np.nan
    k_s, k_p = kruskal(*glist)
    eps2 = epsilon_squared_kruskal(k_s, k, n_total)

    print(f"Levene p={lev_p:.4f} | ANOVA p={a_p:.4f}, eta²={eta2:.3f} | KW p={k_p:.4f}, eps²={eps2:.3f}")

    # Pairwise KS (distributional)
    imps = sorted(groups.keys())
    for i in range(len(imps)):
        for j in range(i+1, len(imps)):
            D, pks = ks_2samp(groups[imps[i]], groups[imps[j]])
            print(f"KS {imps[i]} vs {imps[j]}: D={D:.3f}, p={pks:.4f}")



--- birthyear descriptives ---
               n      mean    std    median    IQR
Impression                                       
0           374 1,974.067 12.117 1,973.000 17.000
1           356 1,974.803 12.658 1,975.000 17.000
2           375 1,975.344 12.815 1,975.000 20.000
3           395 1,974.651 13.259 1,974.000 20.000
Levene p=0.0868 | ANOVA p=0.5908, eta²=0.001 | KW p=0.4977, eps²=-0.000
KS 0 vs 1: D=0.070, p=0.3089
KS 0 vs 2: D=0.092, p=0.0740
KS 0 vs 3: D=0.061, p=0.4403
KS 1 vs 2: D=0.058, p=0.5393
KS 1 vs 3: D=0.076, p=0.2106
KS 2 vs 3: D=0.047, p=0.7598

--- number_items_reported descriptives ---
               n  mean   std  median   IQR
Impression                               
0           374 2.078 2.006   1.000 1.000
1           356 1.736 1.407   1.000 1.000
2           375 1.971 1.792   1.000 1.000
3           395 1.967 1.999   1.000 1.000
Levene p=0.0800 | ANOVA p=0.0800, eta²=0.005 | KW p=0.0670, eps²=0.003
KS 0 vs 1: D=0.091, p=0.0867
KS 0 vs 2: D=0.041, p=0.

#### Task 1A: Interpretation 

- Allocation roughly equal; missingness not arm-dependent.
- Balance: man χ² p>0.05, Cramér’s V small; numeric KW p>0.05, epsilon² near 0; KS mostly ns.

##
- Conclusion: Randomization looks successful on observed covariates (ITT basis).
- Limits: No IDs (independence), no timestamps (seasonality) → state explicitly.

### Task 1B — Do impressions affect claim amount? Do gender/age matter?

Use hypothesis testing framework:
Null hypothesis: No difference in claim amount across impressions.
Alternative: At least one impression changes average claim amount.

Theory suggest conditional correlation tests / regressions as standard tools (e.g., Claims = f(Tariff factors + treatment variable)
).

Can use ANOVA or regression with impression dummies to test significance.

Extend model by including gender and age as covariates → lecture material highlights their importance in explaining claim frequency/severity

If claim differs by gender/age, tie back to selection and moral hazard theory:

Gender/age may proxy for risk types (selection).

Warnings (treatments) may influence behavior (moral hazard).

In [16]:
# Distribution & tests by Impression
desc_claim = AB_test.groupby("Impression")["claim"].agg(["count","mean","std","median","min","max"])
print(desc_claim.round(2), "\n")

# Variances
g_claim = [g["claim"].dropna() for _, g in AB_test.groupby("Impression")]
lev_s, lev_p = levene(*g_claim)
print(f"Levene test for claim variances: p={lev_p:.4f}")

# ANOVA & Kruskal–Wallis for claim ~ Impression
a_s, a_p = (f_oneway(*g_claim) if all(len(g)>1 for g in g_claim) else (np.nan, np.nan))
k_s, k_p = kruskal(*g_claim)
print(f"ANOVA p={a_p:.4f} | Kruskal–Wallis p={k_p:.4f}")


            count    mean     std  median   min       max
Impression                                               
0             374 911.250 930.020 580.350 0.000 5,500.000
1             356 866.240 828.710 630.100 0.000 5,990.000
2             375 791.260 781.930 541.200 0.000 5,643.600
3             395 818.040 832.520 559.100 0.000 6,900.000 

Levene test for claim variances: p=0.1461
ANOVA p=0.2181 | Kruskal–Wallis p=0.3843


In [14]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# OLS with robust SE (HC3). ITT: impressions as assigned.
model = smf.ols("claim ~ C(Impression) + man + birthyear", data=AB_test).fit(cov_type="HC3")
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  claim   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     3.847
Date:                Tue, 26 Aug 2025   Prob (F-statistic):            0.00182
Time:                        11:35:51   Log-Likelihood:                -12228.
No. Observations:                1500   AIC:                         2.447e+04
Df Residuals:                    1494   BIC:                         2.450e+04
Df Model:                           5                                         
Covariance Type:                  HC3                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept           6407.4592   3597

In [15]:
model_log = smf.ols("np.log1p(claim) ~ C(Impression) + man + birthyear", data=AB_test).fit(cov_type="HC3")
print(model_log.summary())


                            OLS Regression Results                            
Dep. Variable:        np.log1p(claim)   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     4.806
Date:                Tue, 26 Aug 2025   Prob (F-statistic):           0.000231
Time:                        11:36:54   Log-Likelihood:                -2135.6
No. Observations:                1500   AIC:                             4283.
Df Residuals:                    1494   BIC:                             4315.
Df Model:                           5                                         
Covariance Type:                  HC3                                         
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             19.1960      4

### Task 1B - Interpretation
- Report p-values from KW/ANOVA; if significant → impressions affect claim amounts (supports hypothesis).
- From OLS: interpret coefficients on C(Impression)[T.X] (arm differences vs baseline), and on man, birthyear.

Note robustness (log model if skewed).

Causal language justified for impression (randomized), while man/birthyear are correlational modifiers of outcome level.

### Task 1C - Management advice

Notes:
Tie back to Customer Lifetime Value (CLTV) framework:
Any intervention must be evaluated not just on claim reduction but on long-term profitability and retention
From experiments theory:
Don’t rely only on one sample → replicate A/B test or validate before rollout.
Consider ethical/statutory constraints of interventions (lecture stresses this when discussing treatments).
Possible advice: Instead of immediately rolling out the “best” impression, recommend further testing, monitoring over time

- If one impression lowers claims materially and results are statistically robust → consider it, but:
- Check customer experience and fairness/regulatory implications.
- Prefer a staged rollout or extended test collecting broader KPIs (see 1D).
- Frame decision in terms of expected savings vs. risk to retention/brand (lecture emphasis on CLTV and DGP).


###  Task 1D - Additional outcome variables for improved insight?

- Customer satisfaction / renewal (longer term → CLTV).
- Claim amount is only one outcome. Other relevant measures:
  - Number of items reported,
  - Claim denial/fraud suspicion rates,
  - Claim processing time,
  - Long-term outcomes (customer retention, renewals).
- These tie into the emphasis on CLTV and holistic evaluation.

**Conclusion:** For full managerial insight, extend AB test evaluation to multiple outcomes, not just claim amount.
