# Car Evaluation Analysis

## Summary

We aim to develop a Random Forest classification model to predict car acceptability categories: unacceptable (unacc), acceptable (acc), good (good), and very good (vgood). This prediction is based on categorical attributes such as price, maintenance cost, safety, and seating capacity. The data is sourced from the Car Evaluation dataset found in the UCI Machine Learning Repository (Bohanec, 1998). 

Our Random Forest model achieved an impressive overall accuracy of 94.68% and a strong Kappa statistic of 0.8822, indicating robust predictive performance and a strong agreement between predicted and actual classes. However, performance was heavily influenced by class balance, as shown in the confusion matrix. To enhance the model, we recommend addressing class imbalance through the use of class weighting. Overall, the model provides a highly accurate and interpretable approach to car acceptability classification, but further refinements could improve fairness across all categories. In summary, this analysis demonstrates how machine learning can be effectively applied to real-world decision-making in the automotive sector.

## Introduction

Evaluating car acceptability is a critical factor in decision-making within the automotive industry. It affects consumer choices regarding vehicle purchases, manufacturers' priorities, and dealership strategies. Car purchases represent one of the most significant financial decisions for households, with affordability being the primary barrier for many buyers. Price plays a crucial role in accessibility, particularly for budget-conscious consumers, such as first-time buyers or individuals in emerging markets.

According to Chiu et al. (2022), price dispersion positively impacts car acceptability. Because car purchases are high-cost, long-term investments, misaligned choices due to information asymmetry—such as undervaluing safety features—can lead to serious financial or safety repercussions (Canada, 2024). For instance, vehicles with poor safety ratings have been linked to higher accident rates. Consequently, Vrkljan et al. (2011) concluded that safety is considered the most important feature when purchasing a vehicle. Automating car evaluations helps buyers efficiently identify optimal vehicles, aligning with the trend toward data-driven consumer tools. Recognizing critical features, such as safety and price, reflects the industry's priorities in vehicle design.

This project aims to develop a classification model to predict car acceptability based on various features such as price, maintenance cost, safety, and seating capacity. This analysis uses the Car Evaluation dataset, which contains 1,728 instances and six features: buying price, maintenance cost, number of doors, seating capacity, luggage size, and safety rating. With an increasing number of car models available in the market, understanding how different attributes affect car classification can help streamline the evaluation process. The goal is to predict car acceptability categories: unacceptable (unacc), acceptable (acc), good (good), and very good (vgood). Being able to predict car acceptability can enhance automated recommendations, improve quality control, and support consumer purchasing decisions.

## Method & Results

In [1]:
library(tidyverse)
library(randomForest)  
library(caret)  

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


载入程序包：'randomForest'


The following object is masked from 'package:dplyr':

    combine


T

In [3]:
data <- read.table("data/car.data", header = FALSE, sep = ",")
data|>nrow()
data <- data |> rename(buying = V1,
               maint = V2,
               doors = V3,
               persons = V4, 
               lug_boot = V5,
               safety = V6,
               class = V7)
data$buying <- as.factor(data$buying) 
data$maint <- as.factor(data$maint) 
data$doors <- as.factor(data$doors) 
data$persons <- as.factor(data$persons) 
data$lug_boot <- as.factor(data$lug_boot) 
data$safety <- as.factor(data$safety) 
data$class <- as.factor(data$class) 

In [4]:
n <- nrow(data)
trainidx <- sample.int(n, floor(n * .75))
testidx <- setdiff(1:n, trainidx)
train <- data[trainidx, ]
test <- data[testidx, ]
rf <- randomForest(class ~ ., data = train)
bag <- randomForest(class ~ ., data = train, mtry = ncol(data) - 1)
preds <-  tibble(truth = test$class, rf = predict(rf, test), bag = predict(bag, test))

In [5]:
predictions <- predict(rf, test)

conf_matrix <- confusionMatrix(predictions, test$class)
conf_matrix

Confusion Matrix and Statistics

          Reference
Prediction acc good unacc vgood
     acc    84   10     4     6
     good    0   15     0     0
     unacc   0    0   303     0
     vgood   0    3     0     7

Overall Statistics
                                         
               Accuracy : 0.9468         
                 95% CI : (0.9212, 0.966)
    No Information Rate : 0.7106         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.8822         
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: acc Class: good Class: unacc Class: vgood
Sensitivity              1.0000     0.53571       0.9870      0.53846
Specificity              0.9425     1.00000       1.0000      0.99284
Pos Pred Value           0.8077     1.00000       1.0000      0.70000
Neg Pred Value           1.0000     0.96882       0.9690      0.98578
Prevalence      

## Disussion

## References

Canada, F. C. A. of. (2024, January 5). Government of Canada. Canada.ca. https://www.canada.ca/en/financial-consumer-agency/services/loans/financing-car/risks.html

Chiu, L., Du, J., & Wang, N. (2022). The Effects of Price Dispersion on Sales in the Automobile Industry: A Dynamic Panel Analysis. SAGE Open. https://doi.org/10.1177/21582440221120647

Vrkljan, B. H., & Anaby, D. (2011). What vehicle features are considered important when buying an automobile? An examination of driver preferences by age and gender. Journal of Safety Research, 42(1), 61-65. https://doi.org/10.1016/j.jsr.2010.11.006

Bohanec, M. (1988). Car Evaluation [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5JP48.
