In this project, I developed a baseline insurance claim prediction model using logistic regression and identified the single most predictive feature. The resulting model is highly interpretable, production-ready, and aligned with business constraints requiring simplicity and ease of deployment.

# Project Background

Insurance companies invest significant time and resources into pricing optimization and accurately estimating the likelihood that customers will file insurance claims. In many countries, car insurance is a legal requirement for driving on public roads, making the market both large and highly competitive.

Source: Accenture – Machine Learning in Insurance (https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf`)

## Problem Statement

On the Road car insurance has requested support in building a machine learning model to predict whether a customer will make a claim during their policy period.

Due to limited expertise and infrastructure for deploying and monitoring complex machine learning systems, the company has requested a simple and interpretable solution. Specifically, the goal is to identify the single feature that produces the best-performing predictive model, measured using accuracy, to enable an easy first step into production.

In [2]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit

In [3]:
# Read in dataset
cars = pd.read_csv("car_insurance.csv")

# Check for missing values
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

In [4]:
# Fill missing values with the mean
cars["credit_score"].fillna(cars["credit_score"].mean(), inplace=True)
cars["annual_mileage"].fillna(cars["annual_mileage"].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars["credit_score"].fillna(cars["credit_score"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  cars["annual_mileage"].fillna(cars["annual_mileage"].mean(), inplace=True)


In [5]:
# Empty list to store model results
models = []

In [6]:
# Feature columns
features = cars.drop(columns=["id", "outcome"]).columns

# Loop through features
for col in features:
    # Create a model
    model = logit(f"outcome ~ {col}", data=cars).fit()
    # Add each model to the models list
    models.append(model)

Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572557
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
  

In [7]:
# Empty list to store accuracies
accuracies = []

# Loop through models
for feature in range(0, len(models)):
    # Compute the confusion matrix
    conf_matrix = models[feature].pred_table()
    # True negatives
    tn = conf_matrix[0,0]
    # True positives
    tp = conf_matrix[1,1]
    # False negatives
    fn = conf_matrix[1,0]
    # False positives
    fp = conf_matrix[0,1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)

In [8]:
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

In [9]:
# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
                                "best_accuracy": max(accuracies)},
                                index=[0])
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.7771


This directly answers the business question:

**Which single feature should we deploy first in production?**

**Driving experience** is the most predictive feature because it closely reflects both driver skill and real-world risk exposure. Drivers with fewer years of experience are more likely to make errors, be involved in accidents, and file insurance claims, while more experienced drivers tend to drive more cautiously and manage risk better. As a result, driving experience alone provides strong, intuitive predictive power and is well suited for a simple, interpretable insurance risk model.

This approach provides a strong baseline and foundation for future model expansion once infrastructure and expertise mature.