![Parked car](car.jpg)

Insurance companies invest a lot of [time and money](https://www.accenture.com/_acnmedia/pdf-84/accenture-machine-leaning-insurance.pdf) into optimizing their pricing and accurately estimating the likelihood that customers will make a claim. In many countries, insurance is a legal requirement to have car insurance in order to drive a vehicle on public roads, so the market is very large!

Knowing all of this, On the Road car insurance has requested your services in building a model to predict whether a customer will make a claim on their insurance during the policy period. As they have very little expertise and infrastructure for deploying and monitoring machine learning models, they've asked you to use simple Logistic Regression, identifying the single feature that results in the best-performing model, as measured by accuracy.

They have supplied you with their customer data as a csv file called `car_insurance.csv`, along with a table (below) detailing the column names and descriptions below.

## The dataset

| Column | Description |
|--------|-------------|
| `id` | Unique client identifier |
| `age` | Client's age: <br> <ul><li>`0`: 16-25</li><li>`1`: 26-39</li><li>`2`: 40-64</li><li>`3`: 65+</li></ul> |
| `gender` | Client's gender: <br> <ul><li>`0`: Female</li><li>`1`: Male</li></ul> |
| `driving_experience` | Years the client has been driving: <br> <ul><li>`0`: 0-9</li><li>`1`: 10-19</li><li>`2`: 20-29</li><li>`3`: 30+</li></ul> |
| `education` | Client's level of education: <br> <ul><li>`0`: No education</li><li>`1`: High school</li><li>`2`: University</li></ul> |
| `income` | Client's income level: <br> <ul><li>`0`: Poverty</li><li>`1`: Working class</li><li>`2`: Middle class</li><li>`3`: Upper class</li></ul> |
| `credit_score` | Client's credit score (between zero and one) |
| `vehicle_ownership` | Client's vehicle ownership status: <br><ul><li>`0`: Does not own their vehilce (paying off finance)</li><li>`1`: Owns their vehicle</li></ul> |
| `vehcile_year` | Year of vehicle registration: <br><ul><li>`0`: Before 2015</li><li>`1`: 2015 or later</li></ul> |
| `married` | Client's marital status: <br><ul><li>`0`: Not married</li><li>`1`: Married</li></ul> |
| `children` | Client's number of children |
| `postal_code` | Client's postal code | 
| `annual_mileage` | Number of miles driven by the client each year |
| `vehicle_type` | Type of car: <br> <ul><li>`0`: Sedan</li><li>`1`: Sports car</li></ul> |
| `speeding_violations` | Total number of speeding violations received by the client | 
| `duis` | Number of times the client has been caught driving under the influence of alcohol |
| `past_accidents` | Total number of previous accidents the client has been involved in |
| `outcome` | Whether the client made a claim on their car insurance (response variable): <br><ul><li>`0`: No claim</li><li>`1`: Made a claim</li></ul> |

In [76]:
# Import required libraries
library(readr)
library(dplyr)
library(glue)
library(yardstick)
library(caret)

# Start coding!

In [77]:
df <- read.csv("car_insurance.csv")
df

id,age,gender,race,driving_experience,education,income,credit_score,vehicle_ownership,vehicle_year,married,children,postal_code,annual_mileage,vehicle_type,speeding_violations,duis,past_accidents,outcome
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>,<dbl>
569520,3,0,1,0,2,3,0.6290273,1,1,0,1,10238,12000,0,0,0,0,0
750365,0,1,1,0,0,0,0.3577571,0,0,0,0,10238,16000,0,0,0,0,1
199901,0,0,1,0,2,1,0.4931458,1,0,0,0,10238,11000,0,0,0,0,0
478866,0,1,1,0,3,1,0.2060129,1,0,0,1,32765,11000,0,0,0,0,0
731664,1,1,1,1,0,1,0.3883659,1,0,0,0,32765,12000,0,2,0,1,1
877557,2,0,1,2,2,3,0.6191274,1,1,0,1,10238,13000,0,3,0,3,0
930134,3,1,1,3,2,3,0.4929436,0,1,1,1,10238,13000,0,7,0,3,0
461006,1,0,1,0,3,1,0.4686893,0,1,0,1,10238,14000,0,0,0,0,1
68366,2,0,1,2,3,1,0.5218149,0,0,1,0,10238,13000,0,0,0,0,0
445911,2,0,1,0,2,3,0.5615310,1,0,0,1,32765,11000,0,0,0,0,1


In [78]:
str(df)

'data.frame':	10000 obs. of  19 variables:
 $ id                 : int  569520 750365 199901 478866 731664 877557 930134 461006 68366 445911 ...
 $ age                : int  3 0 0 0 1 2 3 1 2 2 ...
 $ gender             : int  0 1 0 1 1 0 1 0 0 0 ...
 $ race               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ driving_experience : int  0 0 0 0 1 2 3 0 2 0 ...
 $ education          : int  2 0 2 3 0 2 2 3 3 2 ...
 $ income             : int  3 0 1 1 1 3 3 1 1 3 ...
 $ credit_score       : num  0.629 0.358 0.493 0.206 0.388 ...
 $ vehicle_ownership  : num  1 0 1 1 1 1 0 0 0 1 ...
 $ vehicle_year       : int  1 0 0 0 0 1 1 1 0 0 ...
 $ married            : num  0 0 0 0 0 0 1 0 1 0 ...
 $ children           : num  1 0 0 1 0 1 1 1 0 1 ...
 $ postal_code        : int  10238 10238 10238 32765 32765 10238 10238 10238 10238 32765 ...
 $ annual_mileage     : num  12000 16000 11000 11000 12000 13000 13000 14000 13000 11000 ...
 $ vehicle_type       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ speeding_violations:

In [79]:
df <- df %>% select(-id)

In [80]:
age_labels <- c("16-25", "26-39", "40-64", "65+")
df$age <- factor(df$age, levels = 0:3, labels = age_labels)
columns_to_factorize <- c("gender", "driving_experience", 
                          "education", "income", "vehicle_ownership", 
                          "vehicle_year", "married", "postal_code", "vehicle_type", "outcome","race")
# Con mutate_if convertimos las columnas en categóricas
df <- df %>%
  mutate_if(names(.) %in% columns_to_factorize, as.factor)

In [81]:
# Verificamos el cambio
str(df)

'data.frame':	10000 obs. of  18 variables:
 $ age                : Factor w/ 4 levels "16-25","26-39",..: 4 1 1 1 2 3 4 2 3 3 ...
 $ gender             : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
 $ race               : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ driving_experience : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 2 3 4 1 3 1 ...
 $ education          : Factor w/ 3 levels "0","2","3": 2 1 2 3 1 2 2 3 3 2 ...
 $ income             : Factor w/ 4 levels "0","1","2","3": 4 1 2 2 2 4 4 2 2 4 ...
 $ credit_score       : num  0.629 0.358 0.493 0.206 0.388 ...
 $ vehicle_ownership  : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 1 1 1 2 ...
 $ vehicle_year       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 2 2 1 1 ...
 $ married            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
 $ children           : num  1 0 0 1 0 1 1 1 0 1 ...
 $ postal_code        : Factor w/ 4 levels "10238","21217",..: 1 1 1 3 3 1 1 1 1 3 ...
 $ annual_mileage     : num  12000 1600

In [82]:
colSums(is.na(df))

In [83]:
impute_with_mean <- function(x) {
  mean_value <- mean(x, na.rm = TRUE)  # Calculate mean, ignoring NA values
  x_imputed <- ifelse(is.na(x), mean_value, x)  # Impute missing values with mean
  return(x_imputed)
}

In [84]:
df$credit_score <- impute_with_mean(df$credit_score)
df$annual_mileage <- impute_with_mean(df$annual_mileage)

In [85]:
#cars <- df %>% select(-outcome)

In [86]:
# Armo un df vacío para luego iterar con el ciclo for
features_df <- data.frame(feature = character(), accuracy = numeric())

In [87]:
# Genero modelos de regresión logística con cada columna para ver los mejores resultados
features_df <- data.frame() # Create an empty data frame to store the results
for (col in names(cars)) {
  formula <- as.formula(paste("outcome ~", col))
  #(Generalized Linear Models, GLM)
  model <- glm(formula, data = df, family = binomial)
  predictions <- predict(model, newdata = cars, type = "response")
  #Se usa un if else: si la predicción es mayor a 0.5, es 1. Si no, 0
  predicted_classes <- ifelse(predictions > 0.5, 1, 0)
  #Calculamos el número total de predicciones correctas sumando los valores TRUE en el vector resultante de la comparación.
  accuracy <- sum(predicted_classes == df$outcome) / nrow(df)
  features_df <- rbind(features_df, data.frame(feature = col, accuracy = accuracy))
}

In [88]:
print(features_df)

               feature accuracy
1                  age   0.7747
2               gender   0.6867
3                 race   0.6867
4   driving_experience   0.7771
5            education   0.6867
6               income   0.7425
7         credit_score   0.7054
8    vehicle_ownership   0.7351
9         vehicle_year   0.6867
10             married   0.6867
11            children   0.6867
12         postal_code   0.6987
13      annual_mileage   0.6904
14        vehicle_type   0.6867
15 speeding_violations   0.6867
16                duis   0.6867
17      past_accidents   0.6867


In [89]:
conf_matrix <- confusionMatrix(as.factor(predicted_classes), as.factor(df$outcome))
print(conf_matrix)

“Levels are not in the same order for reference and data. Refactoring data to match.”


Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6867 3133
         1    0    0
                                          
               Accuracy : 0.6867          
                 95% CI : (0.6775, 0.6958)
    No Information Rate : 0.6867          
    P-Value [Acc > NIR] : 0.5048          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.0000          
         Pos Pred Value : 0.6867          
         Neg Pred Value :    NaN          
             Prevalence : 0.6867          
         Detection Rate : 0.6867          
   Detection Prevalence : 1.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : 0               
                        

In [90]:
best_feature_row <- features_df[which.max(features_df$accuracy), ]
best_feature <- best_feature_row$feature
best_accuracy <- best_feature_row$accuracy
best_feature_df <- data.frame(best_feature, best_accuracy)

In [91]:
print(best_feature_df)

        best_feature best_accuracy
1 driving_experience        0.7771
