![car](car.jpg)

Insurance companies invest a lot of [time and money](https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf) into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries insurance it is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!

Knowing all of this, On the Road car insurance have requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to identify the single feature that results in the best performing model, as measured by accuracy, so they can start with a simple model in production.

They have supplied you with their customer data as a csv file called `car_insurance.csv`, along with a table detailing the column names and descriptions below.



## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-15</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code | 
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client | 
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

For this project we want to:

- Identify the single feature of the data that is the best predictor of whether a customer will put in a claim (the "outcome" column), excluding the "id" column.
- Store as a DataFrame called best_feature_df, containing columns named "best_feature" and "best_accuracy" with the name of the feature with the highest accuracy, and the respective accuracy score.

First and foremost, importing the necessary libraries for the model as well as Pandas and Numpy for the manipulation is a must.

In [9]:
# Import required modules
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
from sklearn.metrics import accuracy_score


Before diving into the specific question, exploring a correlation matrix and associated visualizations helps reveal potential relationships between variables and the outcome. These initial insights guide our exploration.

With a focus on excluding the "id" column, the task is to identify the single feature that best predicts whether a customer will make a claim. A well-thought-out plot, such as the correlation matrix, might provide a clear visual representation of the impact of different features on the outcome.

In [10]:
# Create DataFrame from .csv, Fill NaN values with the mean of each column where applicable
df = pd.read_csv('car_insurance.csv')

# Check for missing values
df.info()

# Fill missing values with the mean
df["credit_score"].fillna(df["credit_score"].mean(), inplace=True)
df["annual_mileage"].fillna(df["annual_mileage"].mean(), inplace=True)

# We are interested in the correlation of all columns with 'outcome'
correlation_matrix = df.corr(numeric_only=True)

# Display the correlation with 'outcome'
correlation_matrix.style.background_gradient(cmap='coolwarm')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   10000 non-null  int64  
 1   age                  10000 non-null  int64  
 2   gender               10000 non-null  int64  
 3   driving_experience   10000 non-null  object 
 4   education            10000 non-null  object 
 5   income               10000 non-null  object 
 6   credit_score         9018 non-null   float64
 7   vehicle_ownership    10000 non-null  float64
 8   vehicle_year         10000 non-null  object 
 9   married              10000 non-null  float64
 10  children             10000 non-null  float64
 11  postal_code          10000 non-null  int64  
 12  annual_mileage       9043 non-null   float64
 13  vehicle_type         10000 non-null  object 
 14  speeding_violations  10000 non-null  int64  
 15  duis                 10000 non-null  

Unnamed: 0,id,age,gender,credit_score,vehicle_ownership,married,children,postal_code,annual_mileage,speeding_violations,duis,past_accidents,outcome
id,1.0,0.013512,-0.007343,0.001621,0.009197,0.014826,0.001233,0.006038,-0.002111,0.008156,0.009268,0.001831,-0.010506
age,0.013512,1.0,0.005929,0.471419,0.27214,0.384759,0.383708,0.008553,-0.263838,0.458413,0.281937,0.431061,-0.448463
gender,-0.007343,0.005929,1.0,-0.077478,0.007385,0.008393,-0.00264,-0.001996,-0.015068,0.202095,0.094202,0.223202,0.107208
credit_score,0.001621,0.471419,-0.077478,1.0,0.295689,0.267074,0.209515,0.008533,-0.157641,0.194645,0.120953,0.172077,-0.30901
vehicle_ownership,0.009197,0.27214,0.007385,0.295689,1.0,0.175626,0.12599,-0.004866,-0.092701,0.133868,0.086567,0.119521,-0.378921
married,0.014826,0.384759,0.008393,0.267074,0.175626,1.0,0.287009,0.012045,-0.43952,0.218855,0.12084,0.215269,-0.262104
children,0.001233,0.383708,-0.00264,0.209515,0.12599,0.287009,1.0,0.020911,-0.425813,0.220415,0.115354,0.206295,-0.232835
postal_code,0.006038,0.008553,-0.001996,0.008533,-0.004866,0.012045,0.020911,1.0,-0.127286,0.113686,0.038492,-0.116985,0.095889
annual_mileage,-0.002111,-0.263838,-0.015068,-0.157641,-0.092701,-0.43952,-0.425813,-0.127286,1.0,-0.308125,-0.111232,-0.18718,0.177575
speeding_violations,0.008156,0.458413,0.202095,0.194645,0.133868,0.218855,0.220415,0.113686,-0.308125,1.0,0.359838,0.443074,-0.291862


Once the best predictor is identified, it's crucial to understand its significance. Visualizations like a bar plot or confusion matrix can provide insights into how well this single feature predicts the outcome. This step aids in interpreting the model's accuracy. We noted that the age has a direct correlation to the outcome, so it may be good to look at the information related to the age and perhaps a non-numeric variable, like the driving experience, which directly relates to the age of an insured person.

In [12]:
features = df.drop(columns=["id", "outcome"]).columns

# Dictionary to store performance metrics
model_list = []

# Loop through features
for col in features:
    # Create a model
    model = logit(f"outcome ~ {col}", data=df).fit()
    # Add each model to the models list
    model_list.append(model)

# Empty list to store accuracies
accuracies = []

# Loop through models
for feature in range(0, len(model_list)):
    # Compute the confusion matrix
    conf_matrix = model_list[feature].pred_table()
    # True negatives
    tn = conf_matrix[0,0]
    # True positives
    tp = conf_matrix[1,1]
    # False negatives
    fn = conf_matrix[1,0]
    # False positives
    fp = conf_matrix[0,1]
    # Compute accuracy
    acc = (tn + tp) / (tn + fn + fp + tp)
    accuracies.append(acc)
    
# Find the feature with the largest accuracy
best_feature = features[accuracies.index(max(accuracies))]

# Create best_feature_df
best_feature_df = pd.DataFrame({"best_feature": best_feature,
                                "best_accuracy": max(accuracies)},
                                index=[0])
print(best_feature_df)

Optimization terminated successfully.
         Current function value: 0.511794
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.615951
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.467092
         Iterations 8
Optimization terminated successfully.
         Current function value: 0.603742
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.531499
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.572557
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.552412
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.572668
         Iterations 6
Optimization terminated successfully.
         Current function value: 0.586659
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.595431
  

The provided code is performing logistic regression for each feature in the dataset to predict the outcome. The accuracy is then computed for each model, and the feature with the highest accuracy is identified. In this case, "driving_experience" appears to be the feature with the best performance, followed by "age."

**Driving Experience and Age Significance:**

- The fact that "driving_experience" is identified as the best feature suggests that it is a significant predictor of whether a customer will make a claim.
Similarly, "age" is identified as the second-best feature, indicating that it also plays a crucial role in predicting the outcome.
Interpretation of Logistic Regression:

- The coefficients obtained from logistic regression can provide insights into the direction and strength of the relationship between each feature and the outcome. You might consider examining the coefficients of "driving_experience" and "age" to understand their impact.
Visualization of the Logistic Regression Model:

- Visualizing the logistic regression curves for "driving_experience" and "age" can help you understand how the probability of making a claim changes with these features. You can use a plot to show the logistic curves for these two features.

**Further Investigation:**

Consider exploring potential interactions between features. Interaction terms in logistic regression can capture combined effects that might be missed when considering individual features.