# RustyStats: Getting Started

This notebook demonstrates how to use RustyStats for Generalized Linear Models (GLMs).

RustyStats is a Rust-backed GLM library designed for actuarial applications, providing:
- Fast IRLS fitting
- Support for Gaussian, Poisson, Binomial, and Gamma families
- Offsets (for exposure) and observation weights
- Statistical inference (p-values, confidence intervals)
- Relativity tables for pricing

In [1]:
import numpy as np
import rustystats as rs

# Set random seed for reproducibility
np.random.seed(42)

print(f"RustyStats loaded successfully!")

RustyStats loaded successfully!


## 1. Linear Regression (Gaussian GLM)

The simplest GLM - equivalent to ordinary least squares.

In [2]:
# Generate sample data
n = 200
age = np.random.uniform(20, 60, n)
income = np.random.uniform(30, 100, n)

# True model: y = 10 + 0.5*age - 0.2*income + noise
y = 10 + 0.5 * age - 0.2 * income + np.random.randn(n) * 5

# Design matrix (always include intercept column!)
X = np.column_stack([np.ones(n), age, income])

print(f"Response y: {y.shape}")
print(f"Design matrix X: {X.shape}")
print(f"First 3 rows of X:\n{X[:3]}")

Response y: (200,)
Design matrix X: (200, 3)
First 3 rows of X:
[[ 1.         34.98160475 74.94221523]
 [ 1.         58.02857226 35.88979755]
 [ 1.         49.27975767 41.31400999]]


In [3]:
# Fit the model
result = rs.fit_glm(y, X, family="gaussian")

print(f"Converged: {result.converged}")
print(f"Iterations: {result.iterations}")
print(f"Deviance: {result.deviance:.2f}")
print(f"\nCoefficients:")
print(f"  Intercept: {result.params[0]:.3f} (true: 10)")
print(f"  Age:       {result.params[1]:.3f} (true: 0.5)")
print(f"  Income:    {result.params[2]:.3f} (true: -0.2)")

Converged: True
Iterations: 2
Deviance: 4840.32

Coefficients:
  Intercept: 10.772 (true: 10)
  Age:       0.501 (true: 0.5)
  Income:    -0.213 (true: -0.2)


In [4]:
# Print summary table
print(rs.summary(result, feature_names=["Intercept", "Age", "Income"]))

                                 GLM Results                                  

No. Observations:        200     Df Residuals:        197
Df Model:                  2     Deviance:      4840.3194
Converged:              True     Iterations:            2

------------------------------------------------------------------------------
Variable           Coef    Std.Err        z    P>|z|                 95% CI     
------------------------------------------------------------------------------
Intercept       10.7719     1.6798    6.412  <0.0001   [  7.4795,  14.0643]  ***
Age              0.5012     0.0298   16.820  <0.0001   [  0.4428,   0.5596]  ***
Income          -0.2129     0.0171  -12.421  <0.0001   [ -0.2465,  -0.1793]  ***
------------------------------------------------------------------------------
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


## 2. Poisson Regression (Claim Frequency)

The workhorse of insurance pricing - modeling claim counts.

In [5]:
# Simulate insurance portfolio
n = 500

# Risk factors
age = np.random.uniform(18, 70, n)
is_male = np.random.binomial(1, 0.5, n)
urban = np.random.binomial(1, 0.6, n)

# Exposure (policy duration in years)
exposure = np.random.uniform(0.25, 1.0, n)

# True model: log(λ) = -2.5 + 0.02*age + 0.3*male + 0.4*urban
log_rate = -2.5 + 0.02 * age + 0.3 * is_male + 0.4 * urban
true_rate = np.exp(log_rate)

# Generate claim counts
claims = np.random.poisson(exposure * true_rate)

print(f"Total claims: {claims.sum()}")
print(f"Average claims per policy: {claims.mean():.3f}")
print(f"Total exposure: {exposure.sum():.1f} policy-years")

Total claims: 91
Average claims per policy: 0.182
Total exposure: 311.0 policy-years


In [6]:
# Fit Poisson GLM with exposure offset
X = np.column_stack([np.ones(n), age, is_male, urban])

result = rs.fit_glm(
    claims, X, 
    family="poisson",
    offset=np.log(exposure)  # Key: log(exposure) as offset!
)

print(f"Converged: {result.converged}")
print(f"\nCoefficients vs True Values:")
print(f"  Intercept: {result.params[0]:.3f} (true: -2.5)")
print(f"  Age:       {result.params[1]:.4f} (true: 0.02)")
print(f"  Male:      {result.params[2]:.3f} (true: 0.3)")
print(f"  Urban:     {result.params[3]:.3f} (true: 0.4)")

Converged: True

Coefficients vs True Values:
  Intercept: -2.964 (true: -2.5)
  Age:       0.0268 (true: 0.02)
  Male:      0.299 (true: 0.3)
  Urban:     0.481 (true: 0.4)


In [7]:
# Print relativities (for pricing)
print(rs.summary_relativities(
    result, 
    feature_names=["Base Rate", "Age (+1yr)", "Male", "Urban"]
))

                     GLM Relativities (Log Link)                      

No. Observations:        500     Deviance:   312.2552

----------------------------------------------------------------------
Variable              Coef   Relativity             95% CI (Rel)    P>|z|
----------------------------------------------------------------------
Base Rate          -2.9640       0.0516     [  0.0257,   0.1038]  <0.0001 ***
Age (+1yr)          0.0268       1.0271     [  1.0154,   1.0390]  <0.0001 ***
Male                0.2995       1.3492     [  0.9694,   1.8776]   0.0757 .
Urban               0.4808       1.6174     [  1.1352,   2.3045]   0.0078 **
----------------------------------------------------------------------
Relativity = exp(Coef). Values > 1 increase the response.


### Interpreting Relativities

For Poisson with log link:
- **Base Rate**: Expected claims per year for baseline (age=0, female, rural)
- **Age (+1yr)**: Each year of age multiplies the rate by this factor
- **Male**: Being male multiplies the rate by this factor vs female
- **Urban**: Urban location multiplies the rate by this factor vs rural

In [8]:
# Example: Calculate rate for a 40-year-old male in urban area
base_rate = np.exp(result.params[0])
age_factor = np.exp(result.params[1] * 40)
male_factor = np.exp(result.params[2])
urban_factor = np.exp(result.params[3])

predicted_rate = base_rate * age_factor * male_factor * urban_factor
print(f"Predicted annual claim rate: {predicted_rate:.4f}")
print(f"Expected claims per 1000 policies: {predicted_rate * 1000:.1f}")

Predicted annual claim rate: 0.3285
Expected claims per 1000 policies: 328.5


## 3. Gamma Regression (Claim Severity)

Modeling claim amounts - always positive, right-skewed.

In [9]:
# Simulate claim severities (only for policies with claims)
n_claims = 300

# Risk factors for severity
vehicle_age = np.random.uniform(0, 15, n_claims)
is_luxury = np.random.binomial(1, 0.2, n_claims)

# True model: log(μ) = 7.5 + 0.05*vehicle_age + 0.8*luxury
log_mean = 7.5 + 0.05 * vehicle_age + 0.8 * is_luxury
mean_severity = np.exp(log_mean)

# Gamma with shape=2 (CV ≈ 0.7)
shape = 2.0
severity = np.random.gamma(shape, mean_severity / shape)

print(f"Average severity: ${severity.mean():,.0f}")
print(f"Median severity: ${np.median(severity):,.0f}")
print(f"Max severity: ${severity.max():,.0f}")

Average severity: $3,462
Median severity: $2,680
Max severity: $18,705


In [10]:
# Fit Gamma GLM
X = np.column_stack([np.ones(n_claims), vehicle_age, is_luxury])

result = rs.fit_glm(severity, X, family="gamma")

print(rs.summary_relativities(
    result,
    feature_names=["Base Severity", "Vehicle Age (+1yr)", "Luxury"]
))

                     GLM Relativities (Log Link)                      

No. Observations:        300     Deviance:   167.2562

----------------------------------------------------------------------
Variable              Coef   Relativity             95% CI (Rel)    P>|z|
----------------------------------------------------------------------
Base Severity       7.7198    2252.5266   [1888.0171, 2687.4100]  <0.0001 ***
Vehicle Age (+1     0.0314       1.0319     [  1.0119,   1.0523]   0.0017 **
Luxury              0.6719       1.9579     [  1.5874,   2.4149]  <0.0001 ***
----------------------------------------------------------------------
Relativity = exp(Coef). Values > 1 increase the response.


## 4. Logistic Regression (Claim Probability)

Modeling binary outcomes - will a claim occur?

In [11]:
# Simulate claim occurrence
n = 400

years_licensed = np.random.uniform(0, 30, n)
prior_claims = np.random.poisson(0.3, n)

# True model: logit(p) = -1.5 - 0.05*years + 0.5*prior_claims
logit_p = -1.5 - 0.05 * years_licensed + 0.5 * prior_claims
prob = 1 / (1 + np.exp(-logit_p))

# Generate binary outcomes
had_claim = np.random.binomial(1, prob)

print(f"Claim rate: {had_claim.mean():.1%}")

Claim rate: 11.5%


In [12]:
# Fit Binomial GLM
X = np.column_stack([np.ones(n), years_licensed, prior_claims])

result = rs.fit_glm(had_claim.astype(float), X, family="binomial")

print(rs.summary(result, feature_names=["Intercept", "Years Licensed", "Prior Claims"]))

# Odds ratios
print("\nOdds Ratios:")
print(f"  Years Licensed: {np.exp(result.params[1]):.3f} (per year)")
print(f"  Prior Claims:   {np.exp(result.params[2]):.3f} (per claim)")

                                 GLM Results                                  

No. Observations:        400     Df Residuals:        397
Df Model:                  2     Deviance:       273.0019
Converged:              True     Iterations:            5

------------------------------------------------------------------------------
Variable           Coef    Std.Err        z    P>|z|                 95% CI     
------------------------------------------------------------------------------
Intercept       -1.1482     0.2436   -4.715  <0.0001   [ -1.6256,  -0.6709]  ***
Years Licens    -0.0670     0.0165   -4.056  <0.0001   [ -0.0993,  -0.0346]  ***
Prior Claims    -0.0196     0.2511   -0.078   0.9378   [ -0.5118,   0.4726]     
------------------------------------------------------------------------------
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Odds Ratios:
  Years Licensed: 0.935 (per year)
  Prior Claims:   0.981 (per claim)


## 5. Making Predictions

Use fitted models to predict on new data.

In [13]:
# Refit Poisson model for predictions
n = 500
age = np.random.uniform(18, 70, n)
is_male = np.random.binomial(1, 0.5, n)
exposure = np.random.uniform(0.25, 1.0, n)
claims = np.random.poisson(exposure * np.exp(-2 + 0.02 * age + 0.3 * is_male))

X = np.column_stack([np.ones(n), age, is_male])
result = rs.fit_glm(claims, X, family="poisson", offset=np.log(exposure))

print("Model fitted. Coefficients:")
print(f"  Intercept: {result.params[0]:.3f}")
print(f"  Age:       {result.params[1]:.4f}")
print(f"  Male:      {result.params[2]:.3f}")

Model fitted. Coefficients:
  Intercept: -1.985
  Age:       0.0162
  Male:      0.233


In [14]:
# Predict for new policyholders
new_policyholders = np.array([
    [1, 25, 0],  # 25-year-old female
    [1, 25, 1],  # 25-year-old male
    [1, 45, 0],  # 45-year-old female
    [1, 45, 1],  # 45-year-old male
    [1, 65, 0],  # 65-year-old female
    [1, 65, 1],  # 65-year-old male
])

# Predict annual claim rate (1 year exposure)
new_exposure = np.ones(6)
predicted_rates = rs.glm.predict(
    result, 
    new_policyholders, 
    link="log",
    offset=np.log(new_exposure)
)

print("Predicted Annual Claim Rates:")
print(f"{'Profile':<25} {'Rate':>10} {'Per 1000':>10}")
print("-" * 47)
labels = [
    "25-year-old female", "25-year-old male",
    "45-year-old female", "45-year-old male",
    "65-year-old female", "65-year-old male",
]
for label, rate in zip(labels, predicted_rates):
    print(f"{label:<25} {rate:>10.4f} {rate*1000:>10.1f}")

Predicted Annual Claim Rates:
Profile                         Rate   Per 1000
-----------------------------------------------
25-year-old female            0.2058      205.8
25-year-old male              0.2597      259.7
45-year-old female            0.2842      284.2
45-year-old male              0.3588      358.8
65-year-old female            0.3926      392.6
65-year-old male              0.4956      495.6


## 6. Using Weights

Weights are useful for:
- Grouped/aggregated data
- Known variance differences
- Importance sampling

In [15]:
# Example: Grouped data where each row represents multiple observations
n_groups = 50

# Each group has different number of policies
group_size = np.random.randint(10, 100, n_groups)

# Average values for each group
age_group = np.random.uniform(25, 55, n_groups)
y_group = 100 + 2 * age_group + np.random.randn(n_groups) * 20 / np.sqrt(group_size)

X = np.column_stack([np.ones(n_groups), age_group])

# Fit with weights = group size
result_weighted = rs.fit_glm(y_group, X, family="gaussian", weights=group_size)

# Compare to unweighted
result_unweighted = rs.fit_glm(y_group, X, family="gaussian")

print("Weighted vs Unweighted Regression:")
print(f"{'Coefficient':<15} {'Weighted':>12} {'Unweighted':>12}")
print("-" * 41)
print(f"{'Intercept':<15} {result_weighted.params[0]:>12.3f} {result_unweighted.params[0]:>12.3f}")
print(f"{'Age':<15} {result_weighted.params[1]:>12.3f} {result_unweighted.params[1]:>12.3f}")

Weighted vs Unweighted Regression:
Coefficient         Weighted   Unweighted
-----------------------------------------
Intercept            100.827      100.623
Age                    1.972        1.981


## 7. Statistical Inference

Access p-values, confidence intervals, and significance tests.

In [16]:
# Create a model with clear significant and non-significant effects
n = 300
x1 = np.random.randn(n)  # Strong effect
x2 = np.random.randn(n)  # Weak effect
x3 = np.random.randn(n)  # No effect

y = 5 + 2.0 * x1 + 0.1 * x2 + 0.0 * x3 + np.random.randn(n) * 0.5

X = np.column_stack([np.ones(n), x1, x2, x3])
result = rs.fit_glm(y, X, family="gaussian")

# Access individual inference components
print("Coefficients:", result.params)
print("Std Errors:  ", result.bse())
print("z-values:    ", result.tvalues())
print("p-values:    ", result.pvalues())
print("Significance:", result.significance_codes())

Coefficients: [4.98018042 1.95948781 0.09984191 0.02145732]
Std Errors:   [0.02962501 0.0295145  0.02954883 0.02792049]
z-values:     [168.10732706  66.39068405   3.37887867   0.76851526]
p-values:     [0.         0.         0.00072782 0.44218113]
Significance: ['***', '***', '***', '']


In [17]:
# Confidence intervals
ci = result.conf_int(alpha=0.05)  # 95% CI

print("\n95% Confidence Intervals:")
print(f"{'Variable':<12} {'Lower':>10} {'Estimate':>10} {'Upper':>10}")
print("-" * 44)
names = ["Intercept", "x1 (strong)", "x2 (weak)", "x3 (none)"]
for i, name in enumerate(names):
    print(f"{name:<12} {ci[i, 0]:>10.3f} {result.params[i]:>10.3f} {ci[i, 1]:>10.3f}")


95% Confidence Intervals:
Variable          Lower   Estimate      Upper
--------------------------------------------
Intercept         4.922      4.980      5.038
x1 (strong)       1.902      1.959      2.017
x2 (weak)         0.042      0.100      0.158
x3 (none)        -0.033      0.021      0.076


In [18]:
# Full summary table
print(rs.summary(result, feature_names=["Intercept", "X1 (strong)", "X2 (weak)", "X3 (none)"]))

                                 GLM Results                                  

No. Observations:        300     Df Residuals:        296
Df Model:                  3     Deviance:        77.1705
Converged:              True     Iterations:            2

------------------------------------------------------------------------------
Variable           Coef    Std.Err        z    P>|z|                 95% CI     
------------------------------------------------------------------------------
Intercept        4.9802     0.0296  168.107  <0.0001   [  4.9221,   5.0382]  ***
X1 (strong)      1.9595     0.0295   66.391  <0.0001   [  1.9016,   2.0173]  ***
X2 (weak)        0.0998     0.0295    3.379   0.0007   [  0.0419,   0.1578]  ***
X3 (none)        0.0215     0.0279    0.769   0.4422   [ -0.0333,   0.0762]     
------------------------------------------------------------------------------
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


## 8. Object-Oriented API

For those who prefer a statsmodels-like interface.

In [19]:
# Create model object
model = rs.GLM(
    endog=claims,  # Response
    exog=X,        # Design matrix
    family="poisson",
    offset=np.log(exposure)
)

# Fit with custom options
result = model.fit(max_iter=50, tol=1e-10)

print(f"Model: {model.family} family")
print(f"Observations: {model.nobs}")
print(f"Parameters: {model.df_model + 1}")
print(f"Residual DF: {model.df_resid}")
print(f"\nConverged: {result.converged} in {result.iterations} iterations")

ValueError: endog has 500 obs but exog has 300 rows

---

## Summary

### Key Functions

| Function | Description |
|----------|-------------|
| `rs.fit_glm(y, X, family, ...)` | Fit a GLM |
| `rs.GLM(endog, exog, ...)` | OOP interface |
| `rs.summary(result, ...)` | Regression table |
| `rs.summary_relativities(result, ...)` | Factor table |
| `rs.glm.predict(result, X_new, ...)` | Predictions |

### Families

| Family | Use Case | Link |
|--------|----------|------|
| `gaussian` | Continuous data | identity |
| `poisson` | Count data (frequency) | log |
| `gamma` | Positive continuous (severity) | log |
| `binomial` | Binary/proportion | logit |

### Result Attributes

| Attribute | Description |
|-----------|-------------|
| `result.params` | Coefficients |
| `result.fittedvalues` | Fitted μ values |
| `result.deviance` | Model deviance |
| `result.bse()` | Standard errors |
| `result.pvalues()` | P-values |
| `result.conf_int()` | Confidence intervals |