# The Best Value Cars of 2025
Project 2: Classification, Grayson Nickel, 9/23/2025



# The problem
For this project, I am asking a very similar question to project 1 using the same data set. My question is what are the best value cars of 2025. Last time, I asked essentially the same thing but did not do a good job at calculating the best valued cars. All I did was divide total price by horsepower and determined that the least amount of money per horsepower was the best deal. That way is very flawed so this time I will be using a decision tree considering vehicle price with a limit to how high it can be so we don’t see very expensive vehicles like the Cybertruck again. The decision tree will also consider horsepower with a minimum amount required so the list isn't full of foreign, unsafe and not street legal Tuk-tuks and the like. Essentially this project's problem is a remaster of my project 1’s question using a decision tree to classify each car on the list as either good value or bad.


# The Data
Same as project 1, this data set is for 2025 including over 1,200 cars. The dataset is called “Cars Datasets (2025)” posted by Abdul Malik on kaggle. This dataset is for free use as long as it's not malicious or for profit. This data set has 11 columns in total covering the following data:

Car Company Names: The manufacturer or brand of the car.
Car Models: The specific name or series of the car.
Engine Types: Information on engine specifications .
CC/Battery Capacity: Engine displacement in cubic centimeters or battery capacity for electric cars.
Horsepower (HP): The power output of the car's engine or motor.
Top Speed: The maximum speed the car can achieve.
0-100 km/h Performance: The time it takes for the car to accelerate from 0 to 100 km/h.
Price (in USD): The car's price listed in United States dollars.
Fuel Type: Specifies whether the car uses petrol, diesel, electricity, or hybrid fuel systems.
Seating Capacity: The number of passengers the car can accommodate.
Torque: The rotational force the engine generates.


# Pre-processing
My first step was to set up my environment with all the necessary imports and directories for the data set. Next, I cleaned the data for the price and horsepower columns as they are stored in the data set as strings with money signs, commas, and text which would get in the way of calculations. Next, I removed any rows with missing data. Next I set up my decision tree to make sure everything works, which it did. Next, I calculated which column value I should place as the root node of the decision tree and applied it to the decision tree. I did this by calculating the impurity of ‘Price_Numeric’ and ‘HorsePower_Numeric’ using Gini impurity. With that, I found that ‘Price_Numeric’ has a lower Gini impurity and is the best fit for the root node. Next, I ran a quick print statement to see if things were working. Finally I created a visualization of ALL of the good value candidates using Plotly which allows you to hover over each dot and see more information about each specific car without overcrowding the scatterplot.



# Data Understanding/Visualization
While exploring my improved list of good value cars, I immediately noticed a much cleaner set of data with more realistically attainable cars with my power and price limits in place. I also have a new appreciation for Plotly and how powerful of a tool it can be when trying to plot large data while still including details about each specific point on a plot. My Plotly visualization is a scatterplot containing every single car classified as good value, while including make, model, price, and horsepower for each one of those cars and it doesn't look the least bit crowded. Make sure to scroll to the bottom to see the scatterplot. Try hovering over one of the cars plotted on it while you're there.


# Modeling
The classification model I used for this project is a decision tree containing a check if the price is less than $35,000. False is a leaf setting the item to not good value while true branches to a check seeing if the car makes at least 200 horsepower. This leads to two leaves, true will set the item to good value and false sets it to false. I determined that ‘Price_Numeric’ should be the root node of the decision tree by individually calculating the gini impurity of ‘Price_numeric’ and ‘Horsepower_Numeric’ and then manually comparing to see which impurity was lower. The price was lower impurity so I used that first for efficiency's sake.

# Evaluation
Overall, I say my model performs very well. I calculated the gini impurity once so it’s not like I ever need to run those lines again and the decision tree uses the lower impurity column first and only has two decisions to make total to determine if a car is good value or not. So that paired with this data set only having around 1,200 cars total makes waiting times on execution non-existent.


# Storytelling
I learned what cars from 2025 are a good value considering my personal preferences of a maximum price of $35,000 and a minimum of 200 horsepower. From that list, I particularly like the Toyota 86, GT86, Nissan 370Z, and the Ford Fiesta ST. That information actually checks out because those are all great value cars that have been on my radar for years considering sporty performance and price. With that, I was able to answer my question of what the good value cars of 2025 are. I was even introduced to some new cars I never considered like the Hyundai Veloster.


# Impact Section
With my more accurate list of the good value cars of 2025, I think this list will have a positive impact on people in the market for a new car. Of course the decision tree was made with my preference for maximum price and minimum horsepower in mind but this list should still provide a good baseline for everyone else. I can’t really think of any genuine negative impacts this list could have because all of the more foreign and dangerous or outrageously expensive cars are no longer on the list which makes all of the options more viable for other people looking for a brand new car. However one negative thing to consider about this list is that it cannot predict the reliability of these cars as brand new cars have yet to prove themselves over the test of time and use. That is a rather important thing because most new cars these days use more and more plastic and more and more complex designs which could reduce durability and reliability. And if you’re already on a budget concerning a new car, then you may not be able to also afford expensive repairs if the car ends up being unreliable.


# References
* The dataset: https://www.kaggle.com/datasets/abdulmalik1518/cars-datasets-2025
* Seaborn API: https://seaborn.pydata.org/api.html
* Plotly Library: https://plotly.com/python/
* Video tutorial: https://www.youtube.com/watch?v=LDRbO9a6XPU


# The Code

In [None]:
# All imports and directory for csv.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from google.colab import drive
drive.mount('/content/drive')
cars_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Cars Datasets 2025.csv", encoding="cp1252")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Clean and remove currency symbols, commas, and handle ranges (take the lower value)
cars_df['Price_Numeric'] = cars_df['Cars Prices'].astype(str).str.replace('[$,]', '', regex=True).str.split('-').str[0]
cars_df['Price_Numeric'] = pd.to_numeric(cars_df['Price_Numeric'], errors='coerce')

# Clean and convert 'HorsePower' to numeric
cars_df['HorsePower_Numeric'] = cars_df['HorsePower'].astype(str).str.replace('[hp,]', '', regex=True).str.split('-').str[0].str.split('/').str[0]
cars_df['HorsePower_Numeric'] = pd.to_numeric(cars_df['HorsePower_Numeric'], errors='coerce')

In [None]:
# Create 'Good Value' decision tree storing a boolean value as the final classifier (Order of decision tree was determined by the Gini impurities of Price_Numeric and Horsepower_Numerpic below)
price_condition = cars_df['Price_Numeric'] <= 35000
hp_condition = cars_df['HorsePower_Numeric'] >= 200
cars_df['Good_Value'] = price_condition & hp_condition

In [None]:
# Drop rows with missing values in the relevant columns
cars_df.dropna(subset=['Price_Numeric', 'HorsePower_Numeric', 'Good_Value'], inplace=True)

In [None]:
# Gini of a label vector
gini = lambda y: 1 - (y.value_counts(normalize=True)**2).sum()

# Gini of Price_Numeric <= 35000
m = cars_df['Price_Numeric'] <= 35000
g_price = m.mean()*gini(cars_df.loc[m,'Good_Value']) + (1-m.mean())*gini(cars_df.loc[~m,'Good_Value'])

# Gini of HorsePower_Numeric >= 200
m2 = cars_df['HorsePower_Numeric'] >= 200
g_hp = m2.mean()*gini(cars_df.loc[m2,'Good_Value']) + (1-m2.mean())*gini(cars_df.loc[~m2,'Good_Value'])

print(f"Price_Numeric impurity: {g_price:.4f}")
print(f"HorsePower_Numeric impurity: {g_hp:.4f}")

Price_Numeric impurity: 0.1200
HorsePower_Numeric impurity: 0.1292


In [None]:
# Display list of top 10 good value cars of 2025
good_value_cars = cars_df[cars_df['Good_Value'] == True]
display(good_value_cars[['Company Names', 'Cars Names', 'Price_Numeric', 'HorsePower_Numeric', 'Good_Value']].head(10))

Unnamed: 0,Company Names,Cars Names,Price_Numeric,HorsePower_Numeric,Good_Value
18,TOYOTA,TOYOTA 86,27000.0,205.0,True
19,TOYOTA,TOYOTA GR86,30000.0,228.0,True
23,NISSAN,370Z,30000.0,332.0,True
26,NISSAN,MAXIMA,35000.0,300.0,True
28,NISSAN,ROGUE,28000.0,201.0,True
29,NISSAN,PATHFINDER,35000.0,284.0,True
30,NISSAN,FRONTIER,30000.0,310.0,True
138,AUDI,Q3,35000.0,248.0,True
151,BMW,M135i XDRIVE,30000.0,302.0,True
199,TOYOTA,CAMRY,27000.0,301.0,True


In [None]:
# Plotly figure of good value cars (Hover over each dot to see infomration on the specific car)
fig = px.scatter(good_value_cars, x='Price_Numeric', y='HorsePower_Numeric',
                 hover_name='Cars Names',
                 hover_data=['Company Names'],
                 title='The Best Value Cars of 2025')
fig.update_layout(xaxis_title='Price (USD)', yaxis_title='Horsepower')
fig.show()