Identify the single feature of the data that is the best predictor of whether a customer will put in a claim (the "outcome" column), excluding the "id" column.
Store as a DataFrame called best_feature_df, containing columns named "best_feature" and "best_accuracy" with the name of the feature with the highest accuracy, and the respective accuracy score.

How to approach the project
1. Reading in and exploring the dataset

2. Filling missing values

3. Preparing for modeling

4. Building and storing the models

5. Measuring performance

6. Finding the best performing model

1-Reading in and exploring the dataset
Create a pandas DataFrame and examine for data types, missing values, and distributions.




In [3]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit

# Load the dataset
car_insurance = pd.read_csv(r"C:\Users\Administrator\Desktop\Personal Projects\Modelling Car Insurance Claim Outcomes\car_insurance.csv")
print(car_insurance.head())
print(car_insurance.info())

       id  age  gender driving_experience    education         income  \
0  569520    3       0               0-9y  high school    upper class   
1  750365    0       1               0-9y         none        poverty   
2  199901    0       0               0-9y  high school  working class   
3  478866    0       1               0-9y   university  working class   
4  731664    1       1             10-19y         none  working class   

   credit_score  vehicle_ownership vehicle_year  married  children  \
0      0.629027                1.0   after 2015      0.0       1.0   
1      0.357757                0.0  before 2015      0.0       0.0   
2      0.493146                1.0  before 2015      0.0       0.0   
3      0.206013                1.0  before 2015      0.0       1.0   
4      0.388366                1.0  before 2015      0.0       0.0   

   postal_code  annual_mileage vehicle_type  speeding_violations  duis  \
0        10238         12000.0        sedan                    0  

2-Filling missing values
Prepare data for modeling by ensuring there are no missing values.

In [4]:
car_insurance.isna().sum()
# Fill missing values in 'credit_score' with the median
car_insurance["credit_score"].fillna(car_insurance["credit_score"].median(), inplace=True)
# Fill missing values in 'annual_mileage' with the median
car_insurance["annual_mileage"].fillna(car_insurance["annual_mileage"].median(), inplace=True)
# Verify that there are no missing values
car_insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         10000 non-null  float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       10000 non-null  float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_insurance["credit_score"].fillna(car_insurance["credit_score"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_insurance["annual_mileage"].fillna(car_insurance["annual_mileage"].median(), inplace=True)


3-Preparing for modeling
Create variables for modeling and storing the results.

In [5]:
#Preparing for modeling
#Creating a list to store the models
models = []
#Storing the features as a variable
#Create a variable called features, containing all columns except for "outcome" and "id".
features = car_insurance.drop(columns=['id', 'outcome']).columns


4-Building and storing the models
Build one model per feature and save the results to a list.

In [6]:
#Building and storing the models

#Build one model per feature and save the results to a list.
from statsmodels.formula.api import logit
from sklearn.metrics import confusion_matrix, accuracy_score
# Loop through features
for col in features:
    # Create a model
    model = logit(f"outcome ~ {col}", data=car_insurance).fit()
    # Add each model to the models list
    models.append(model)

Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572649
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
  

5-Measuring performance
Calculate the accuracy of each model.

In [7]:
# Empty list to store accuracies
accuracies = []

# Loop through models
for feature in range(0, len(models)):
    # Compute the confusion matrix
    conf_matrix = models[feature].pred_table()
    # True negatives
    tn = conf_matrix[0,0]
    # True positives
    tp = conf_matrix[1,1]
    # False negatives
    fn = conf_matrix[1,0]
    # False positives
    fp = conf_matrix[0,1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)

6- Finding the best performing model
Locate which model has the highest accuracy score.

In [8]:
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
                                "best_accuracy": max(accuracies)},
                                index=[0])
print(best_feature_df)

         best_feature  best_accuracy
0  driving_experience         0.7771
