# **Car Evaluation Analysis**

### DSCI 310 2024W Group 5

#### Authors:
- Nika Karimi Seffat
- Ethan Wang
- Gautam Arora
- Kevin Li

### **Introduction**

**Background**: 

Cars and personal transportation are an inevitable part of everyday life in the developed world. They play a crucial role in people's daily routines, enabling them to commute to work, travel, attend social gatherings, and explore new places. However, cars also pose significant safety risks. In 2023 alone, car accidents accounted for over 40,000 fatalities (Wikipedia Contributors, 2019). Given these risks, it is no surprise that many individuals seek ways to make car travel safer while maintaining its convenience and necessity.

Research has shown that consumers are willing to invest in vehicle safety. For example, Andersson (2008) found that Swedish drivers were willing to pay a premium for improved car design and build quality to reduce the risk of injury in accidents. Additionally, design and material choices play a critical role in determining a vehicle's safety. Nygren (1983) found that factors such as a car’s weight, seatbelt design, and headrests significantly influenced accident survivability. More recently, Richter et al. (2005) demonstrated that passive safety improvements—such as enhanced structural integrity and interior design modifications—have contributed to a measurable decline in injury rates from car accidents. These findings highlight the importance of identifying key factors that contribute to vehicle safety.

Given this context, our research project aims to answer the following question:

**Can we predict the estimated safety of a car using various attributes, such as its buying price, capacity, and maintenance cost?**

To answer this question, we will use the Car Evaluation Dataset from UC Irvine’s Machine Learning Repository. This multivariate classification dataset contains six car design and classification variables, and includes 1,728 observations. Key variables that will be central to our analysis include the car’s buying price, maintenance cost, seating capacity (in terms of the number of passengers it can accommodate), and the car’s evaluation level (categorized as unacceptable, acceptable, good, or very good).

## **Methods and Results**

### Data Loading

The dataset was retrieved from an online source and loaded into R for analysis using the read.csv function. This dataset contains categorical variables describing various attributes of cars, which will be used for classification.

In [2]:
install.packages("tidyverse")
install.packages("class")
install.packages("caret")


The downloaded binary packages are in
	/var/folders/nv/0gqlq9fj4nqg803htx3l1hw40000gn/T//RtmpF2dfyy/downloaded_packages

The downloaded binary packages are in
	/var/folders/nv/0gqlq9fj4nqg803htx3l1hw40000gn/T//RtmpF2dfyy/downloaded_packages

The downloaded binary packages are in
	/var/folders/nv/0gqlq9fj4nqg803htx3l1hw40000gn/T//RtmpF2dfyy/downloaded_packages


In [3]:
# First, we will load the required libraries necessary to perform data wrangling, visualization, and analysis.
library(tidyverse) # Contains dplyr, ggplot2, and other libraries to perform data cleaning and visualization.
library(class) # For the kNN Classifier.
library(caret) # For train-test-split and cross-validation.

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift




In [4]:
# Loading the dataset from the web
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
car_data <- read.csv(url, header = FALSE, stringsAsFactors = TRUE) # since all input variables are categorical, we set the data type to a factor.

# Display first few rows
head(car_data)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc
5,vhigh,vhigh,2,2,med,med,unacc
6,vhigh,vhigh,2,2,med,high,unacc


Although the dataset comes with predefined columns, it does not include column names when we read in the CSV file. We manually assign meaningful column names based on the UCI dataset documentation.

In [5]:
# Assigning the column names
colnames(car_data) <- c("buying", "maint", "doors", "persons", "lug_boot", "safety", "class")
head(car_data)

Unnamed: 0_level_0,buying,maint,doors,persons,lug_boot,safety,class
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
1,vhigh,vhigh,2,2,small,low,unacc
2,vhigh,vhigh,2,2,small,med,unacc
3,vhigh,vhigh,2,2,small,high,unacc
4,vhigh,vhigh,2,2,med,low,unacc
5,vhigh,vhigh,2,2,med,med,unacc
6,vhigh,vhigh,2,2,med,high,unacc


#### Data Wrangling and Cleaning

In [6]:
# Checking for missing values:
sum(is.na(car_data))

Amazing! No missing values to drop or account for.

In [8]:
# Check for duplicate values:
# Remove duplicate rows (if any)
car_data <- car_data %>% distinct()
nrow(car_data)

We have the same number of rows as before - there were no duplicate rows!

#### Data Types

Since k-Nearest Neighbors (kNN) is a distance-based algorithm, it requires numerical input for feature comparisons. However, our dataset currently consists of categorical variables (all factors in R).

In [19]:
# Every variable in this dataset is an ordinal variable - it falls under the categorical variables that have a natural relationship or hierarchy to them.
# We can use ordinal encoding to transform these factor variables into double so kNN can be used on them. Scaling is NOT needed here.

# Define encoding function for all categorical features (
encode_levels <- function(x) {
  case_when(
    x == "vhigh"  ~ 4,   # Applies to buying and maint variables
    x == "high"   ~ 3,
    x == "med"    ~ 2,
    x == "low"    ~ 1,
    x == "big"    ~ 3,
    x == "small"  ~ 1,
    x == "more"   ~ 5,   # 'more' in persons column treated as 5
    x == "5more"  ~ 5,   # '5more' in doors column treated as 5
    x == "2"      ~ 2,
    x == "3"      ~ 3,
    x == "4"      ~ 4,
    x == "unacc"  ~ 1,
    x == "acc"    ~ 2,
    x == "good"   ~ 3,
    x == "vgood"  ~ 4,
    x == "low"    ~ 1,   # Encoding safety
    x == "med"    ~ 2,   # Encoding safety
    x == "high"   ~ 3,   # Encoding safety
    TRUE          ~ as.numeric(x) # Default conversion for numbers
  )
}

# Apply encoding to all columns except the target variable (safety)
car_data_encoded <- car_data %>%
  mutate(across(-safety, encode_levels)) %>%
  mutate(safety = as.factor(safety))  # Keep target as a factor for classification

# Display first few rows
head(car_data_encoded)

Unnamed: 0_level_0,buying,maint,doors,persons,lug_boot,safety,class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>
1,4,4,2,2,1,low,1
2,4,4,2,2,1,med,1
3,4,4,2,2,1,high,1
4,4,4,2,2,2,low,1
5,4,4,2,2,2,med,1
6,4,4,2,2,2,high,1


### **References**

- Andersson, H. (2008). Willingness to pay for car safety: evidence from Sweden.    Environmental and Resource Economics, 41, 579-594.

- Nygren, Å. (1983). Injuries to car occupants—Some aspects of the interior safety of cars: A study of a five-year material from an insurance company. Acta Oto-Laryngologica, 95(sup395), 1-135.

- Richter, M., Pape, H. C., Otte, D., & Krettek, C. (2005). Improvements in passive car safety led to decreased injury severity–a comparison between the 1970s and 1990s. Injury, 36(4), 484-488.

- Wood, D. P. (1997). Safety and the car size effect: A fundamental explanation. Accident Analysis & Prevention, 29(2), 139-151.

- Wikipedia Contributors. (2019, March 21). Motor vehicle fatality rate in U.S. by year. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Motor_vehicle_fatality_rate_in_U.S._by_year