# Car Insurance Claim Prediction Project

## Introduction

Insurance companies invest significant time and resources into optimizing pricing models and accurately estimating the probability that customers will make a claim. In many countries, car insurance is legally required to drive on public roads, which creates a massive market for such services. 

In this project, **On the Road** car insurance has tasked us with developing a model that predicts whether a customer will file a claim during the policy period. Given their limited expertise and infrastructure in deploying and monitoring machine learning models, they have asked for a **simple solution**: identify the single most predictive feature from their customer dataset, which can help them build an initial model. 

Our goal is to:
1. **Analyze the data** provided in the `car_insurance.csv` file.
2. **Identify the single best feature** that predicts whether a customer will file a claim (as indicated by the "outcome" column).
3. Measure the performance of this feature using **accuracy** as the evaluation metric.
4. Store the result in a DataFrame named `best_feature_df`, containing the following columns:
    - `best_feature`: the name of the most predictive feature.
    - `best_accuracy`: the corresponding accuracy score of the feature.

This approach will enable **On the Road** to start with a simple model in production, ensuring that they can deploy and monitor it effectively while minimizing complexity.


In [31]:
import pandas as pd
from statsmodels.formula.api import logit

In [2]:
df = pd.read_csv("Dataset/car_insurance.csv")

In [3]:
df.head()

Unnamed: 0,id,age,gender,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
0,569520,3,0,0-9y,high school,upper class,0.629027,1.0,after 2015,0.0,1.0,10238,12000.0,sedan,0,0,0,0.0
1,750365,0,1,0-9y,none,poverty,0.357757,0.0,before 2015,0.0,0.0,10238,16000.0,sedan,0,0,0,1.0
2,199901,0,0,0-9y,high school,working class,0.493146,1.0,before 2015,0.0,0.0,10238,11000.0,sedan,0,0,0,0.0
3,478866,0,1,0-9y,university,working class,0.206013,1.0,before 2015,0.0,1.0,32765,11000.0,sedan,0,0,0,0.0
4,731664,1,1,10-19y,none,working class,0.388366,1.0,before 2015,0.0,0.0,32765,12000.0,sedan,2,0,1,1.0


In [4]:
df.describe()

Unnamed: 0,id,age,gender,credit_score,vehicle_ownership,married,children,postal_code,annual_mileage,speeding_violations,duis,past_accidents,outcome
count,10000.0,10000.0,10000.0,9018.0,10000.0,10000.0,10000.0,10000.0,9043.0,10000.0,10000.0,10000.0,10000.0
mean,500521.9068,1.4895,0.499,0.515813,0.697,0.4982,0.6888,19864.5484,11697.003207,1.4829,0.2392,1.0563,0.3133
std,290030.768758,1.025278,0.500024,0.137688,0.459578,0.500022,0.463008,18915.613855,2818.434528,2.241966,0.55499,1.652454,0.463858
min,101.0,0.0,0.0,0.053358,0.0,0.0,0.0,10238.0,2000.0,0.0,0.0,0.0,0.0
25%,249638.5,1.0,0.0,0.417191,0.0,0.0,0.0,10238.0,10000.0,0.0,0.0,0.0,0.0
50%,501777.0,1.0,0.0,0.525033,1.0,0.0,1.0,10238.0,12000.0,0.0,0.0,0.0,0.0
75%,753974.5,2.0,1.0,0.618312,1.0,1.0,1.0,32765.0,14000.0,2.0,0.0,2.0,1.0
max,999976.0,3.0,1.0,0.960819,1.0,1.0,1.0,92101.0,22000.0,22.0,6.0,15.0,1.0


In [5]:
df.isna().sum()

id                       0
age                      0
gender                   0
driving_experience       0
education                0
income                   0
credit_score           982
vehicle_ownership        0
vehicle_year             0
married                  0
children                 0
postal_code              0
annual_mileage         957
vehicle_type             0
speeding_violations      0
duis                     0
past_accidents           0
outcome                  0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

In [7]:
df.shape

(10000, 18)

There are two columns with missing value : credit_score and annual_mileage

In [8]:
# Display all missing values 
df["credit_score"].isna().sum()

982

In [9]:
# Calculate the mean of the credit score columns
mean_credit_score = df["credit_score"].mean()
mean_credit_score

0.515812809602791

In [10]:
# Replace missing values with the mean
df["credit_score"].fillna(mean_credit_score, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["credit_score"].fillna(mean_credit_score, inplace=True)


In [11]:
df["credit_score"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 10000 entries, 0 to 9999
Series name: credit_score
Non-Null Count  Dtype  
--------------  -----  
10000 non-null  float64
dtypes: float64(1)
memory usage: 78.3 KB


In [12]:
df["annual_mileage"].isna().sum()

957

In [13]:
# Calculate the mean of the annual mileage columns
mean_annual_mileage = df["annual_mileage"].mean()
mean_annual_mileage

11697.003206900365

In [14]:
# Replace missing values with the mean
df["annual_mileage"].fillna(mean_annual_mileage)

0       12000.000000
1       16000.000000
2       11000.000000
3       11000.000000
4       12000.000000
            ...     
9995    16000.000000
9996    11697.003207
9997    14000.000000
9998    13000.000000
9999    13000.000000
Name: annual_mileage, Length: 10000, dtype: float64

In [15]:
df["annual_mileage"].isna().sum()

957

In [38]:
# Empty list to store model results
models = []

In [41]:
# Feature columns
features = df.drop(columns=["id", "outcome"]).columns
features

Index(['age', 'gender', 'driving_experience', 'education', 'income',
       'credit_score', 'vehicle_ownership', 'vehicle_year', 'married',
       'children', 'postal_code', 'annual_mileage', 'vehicle_type',
       'speeding_violations', 'duis', 'past_accidents'],
      dtype='object')

In [43]:
# Loop through features
for col in features:
    # Create a model
    model = logit(f"outcome ~ {col}", data=df).fit()
    # Add each model to the models list
    models.append(model)

Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572557
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
  

In [44]:
# Empty list to store accuracies
accuracies = []

In [45]:
# Loop through models
for feature in range(0, len(models)):
    # Compute the confusion matrix
    conf_matrix = models[feature].pred_table()
    # True negatives
    tn = conf_matrix[0,0]
    # True positives
    tp = conf_matrix[1,1]
    # False negatives
    fn = conf_matrix[1,0]
    # False positives
    fp = conf_matrix[0,1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)

In [47]:
accuracies

[0.7747,
 0.6867,
 0.7771,
 0.6867,
 0.7425,
 0.7054,
 0.7351,
 0.6867,
 0.6867,
 0.6867,
 0.6867,
 0.6933539754506248,
 0.6867,
 0.6867,
 0.6867,
 0.6867]

In [46]:
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
                                "best_accuracy": max(accuracies)},
                                index=[0])
best_feature_df

Unnamed: 0,best_feature,best_accuracy
0,driving_experience,0.7771
