Exploratory Data Analysis — Claims Severity
Dataset Overview

The dataset contains 188,318 claim records with 132 features, reflecting the structure of real-world P&C insurance data. The feature space is dominated by categorical variables (116 cat* features) alongside a smaller set of continuous risk attributes (15 cont* features). The target variable, loss, represents claim severity.

This composition mirrors operational insurance datasets, where segmentation variables often carry more signal than individual numeric measures.

Target Variable: Claims Severity

Claims severity exhibits a highly right-skewed distribution, with a long tail of high-cost claims. This behavior is consistent with insurance loss distributions and motivates variance-stabilizing transformations prior to modeling.

To address skewness, the analysis uses a log(1 + loss) transformation, which produces a more symmetric distribution suitable for regression modeling while preserving relative severity ordering.

Tail Risk & Loss Concentration

Tail risk is present but not extreme:

The top 1% of claims account for ~6% of total loss

Losses are meaningfully concentrated, but the portfolio is not dominated by catastrophic outliers

This suggests opportunities for severity-aware prioritization without relying solely on rare extreme events.

Feature Relationships

Individual continuous features show weak linear correlation with log-transformed severity (maximum correlation ≈ 0.10). This indicates that claim severity is not driven by single dominant predictors but rather by complex, non-linear interactions risk factors, which is typical in insurance contexts.

This supports the use of tree-based or ensemble models for severity estimation rather than relying solely on linear assumptions.

EDA Key Takeaways

Claims severity is heavily right-skewed; log transform stabilizes variance and supports regression.

Tail risk exists but is moderately distributed

This portfolio doesn’t appear dominated by extreme catastrophes, but severity concentration is still material enough that ranking and segmentation add value.

Severity drivers are diffuse rather than singular

Segmentation is likely more valuable than single-variable thresholds

These findings guide the modeling approach toward robust baselines and ranking-based evaluation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)


In [None]:
DATA_PATH = (Path().resolve() / ".." / "data" / "train.csv").resolve()
print("Loading:", DATA_PATH)
print("Exists?", DATA_PATH.exists())

df = pd.read_csv(DATA_PATH)
df.head()


In [None]:
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist()[:20], "...")

df.dtypes.value_counts()


In [None]:
target_col = "loss"
id_col = "id"

cat_cols = [c for c in df.columns if c.startswith("cat")]
cont_cols = [c for c in df.columns if c.startswith("cont")]

print("Target:", target_col)
print("ID:", id_col)
print("# categorical:", len(cat_cols))
print("# continuous:", len(cont_cols))


In [None]:
missing = df.isna().mean().sort_values(ascending=False)
missing = missing[missing > 0]
missing


In [None]:
print("Duplicate rows:", df.duplicated().sum())
print("Duplicate ids:", df[id_col].duplicated().sum())


In [None]:
plt.figure()
plt.hist(df[target_col], bins=50)
plt.title("Loss (Severity) Distribution - Raw")
plt.xlabel("loss")
plt.ylabel("count")
plt.show()


In [None]:
df[target_col].describe(percentiles=[0.5, 0.75, 0.9, 0.95, 0.99])


In [None]:
df["log_loss"] = np.log1p(df[target_col])

plt.figure()
plt.hist(df["log_loss"], bins=50)
plt.title("Log(1 + Loss) Distribution")
plt.xlabel("log_loss")
plt.ylabel("count")
plt.show()


In [None]:
df[cont_cols].describe().T.head(10)


In [None]:
corr = df[cont_cols + ["log_loss"]].corr(numeric_only=True)["log_loss"].drop("log_loss").sort_values(key=lambda s: s.abs(), ascending=False)
corr.head(15)


In [None]:
corr_mat = df[cont_cols].corr(numeric_only=True)

plt.figure(figsize=(10, 8))
plt.imshow(corr_mat, aspect="auto")
plt.title("Continuous Feature Correlation (cont*)")
plt.colorbar()
plt.show()


In [None]:
top_cat = cat_cols[0]
df[top_cat].value_counts().head(10)


In [None]:
def top_levels_by_mean_loss(df, cat_col, target="loss", top_n=10, min_count=200):
    temp = df.groupby(cat_col)[target].agg(["count", "mean"]).reset_index()
    temp = temp[temp["count"] >= min_count].sort_values("mean", ascending=False).head(top_n)
    return temp

example = top_levels_by_mean_loss(df, cat_cols[0], target=target_col, top_n=10, min_count=200)
example


In [None]:
results = []
for c in cat_cols[:10]:  # first 10 cats is enough for EDA
    temp = top_levels_by_mean_loss(df, c, target=target_col, top_n=3, min_count=200)
    temp["feature"] = c
    results.append(temp)

seg = pd.concat(results, ignore_index=True)
seg.sort_values(["mean"], ascending=False).head(20)


In [None]:
# Pick one categorical feature to visualize well
cat_to_plot = cat_cols[0]
tmp = df.groupby(cat_to_plot)[target_col].agg(["count", "mean"]).reset_index()
tmp = tmp[tmp["count"] >= 500].sort_values("mean", ascending=False).head(15)

plt.figure(figsize=(10, 4))
plt.bar(tmp[cat_to_plot].astype(str), tmp["mean"])
plt.title(f"Top 15 {cat_to_plot} Levels by Average Severity (min count=500)")
plt.xlabel(cat_to_plot)
plt.ylabel("Avg loss")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


In [None]:
q = df[target_col].quantile([0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999])
q


In [None]:
threshold = df[target_col].quantile(0.99)
top1 = df[df[target_col] >= threshold][target_col].sum()
total = df[target_col].sum()
print("Top 1% threshold:", threshold)
print("Share of total loss in top 1%:", top1 / total)


In [None]:
OUT_DIR = (Path().resolve() / ".." / "outputs" / "charts").resolve()
OUT_DIR.mkdir(parents=True, exist_ok=True)
print("Saving charts to:", OUT_DIR)

def save_fig(filename):
    plt.tight_layout()
    plt.savefig(OUT_DIR / filename, dpi=150)


In [None]:
# Raw loss
plt.figure()
plt.hist(df[target_col], bins=50)
plt.title("Loss (Severity) Distribution - Raw")
plt.xlabel("loss")
plt.ylabel("count")
save_fig("01_loss_raw_hist.png")
plt.show()

# Log loss
plt.figure()
plt.hist(df["log_loss"], bins=50)
plt.title("Log(1 + Loss) Distribution")
plt.xlabel("log_loss")
plt.ylabel("count")
save_fig("02_log_loss_hist.png")
plt.show()

# Segment plot (reuse cat_to_plot logic)
plt.figure(figsize=(10, 4))
plt.bar(tmp[cat_to_plot].astype(str), tmp["mean"])
plt.title(f"Top 15 {cat_to_plot} Levels by Avg Severity (min count=500)")
plt.xlabel(cat_to_plot)
plt.ylabel("Avg loss")
plt.xticks(rotation=45, ha="right")
save_fig(f"03_{cat_to_plot}_top_severity_segments.png")
plt.show()
