# Predicting car price
In this notebook, we will be building machine learning models to predict the car price. The dataset is from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

Once the model is trained, we will predict the buying price using the best model, given the following parameters:

Maintenance = High
Number of doors = 4
Lug Boot Size = Big
Safety = High
Class Value = Good


# Install dependencies
Install the required python packages if it is not installed yet.

In [1]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip:****@gitlab.com/api/v4/projects/37926211/packages/pypi/simple


# Import Packages
These libraries will be used for data handling, model building, and saving the model.

In [2]:
# Importing necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import h2o


# Load and preprocess the data
Load the dataset from the given URL using pandas. Then, we will preprocess the data by converting categorical features into numerical values. A preview of the data before and after preprocessing are printed below.

In [3]:
# Loading the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df = pd.read_csv(url, names=columns, header=None)
df = df.drop('persons', axis=1)  # persons is dropped as it is not provided in the parameters to use for inference
df.head()

Unnamed: 0,buying,maint,doors,lug_boot,safety,class
0,vhigh,vhigh,2,small,low,unacc
1,vhigh,vhigh,2,small,med,unacc
2,vhigh,vhigh,2,small,high,unacc
3,vhigh,vhigh,2,med,low,unacc
4,vhigh,vhigh,2,med,med,unacc


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   lug_boot  1728 non-null   object
 4   safety    1728 non-null   object
 5   class     1728 non-null   object
dtypes: object(6)
memory usage: 81.1+ KB


In [5]:
# Get the value counts for all columns in the dataframe
for col in df.columns:
    print(f"\nValue counts for {col}:")
    print(df[col].value_counts())


Value counts for buying:
vhigh    432
high     432
med      432
low      432
Name: buying, dtype: int64

Value counts for maint:
vhigh    432
high     432
med      432
low      432
Name: maint, dtype: int64

Value counts for doors:
2        432
3        432
4        432
5more    432
Name: doors, dtype: int64

Value counts for lug_boot:
small    576
med      576
big      576
Name: lug_boot, dtype: int64

Value counts for safety:
low     576
med     576
high    576
Name: safety, dtype: int64

Value counts for class:
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


It seems like the dataset is very balanced and there are no missing values. All the fields are also categorical variables so they should be encoded before we fit the models.

In [6]:
# Convert categorical variables to numerical variables
# Create separate LabelEncoder instances for each column
le_buying = LabelEncoder()
le_maint = LabelEncoder()
le_doors = LabelEncoder()
le_lug_boot = LabelEncoder()
le_safety = LabelEncoder()
le_class = LabelEncoder()
df['buying'] = le_buying.fit_transform(df['buying'])
df['maint'] = le_maint.fit_transform(df['maint'])
df['doors'] = le_doors.fit_transform(df['doors'])
df['lug_boot'] = le_lug_boot.fit_transform(df['lug_boot'])
df['safety'] = le_safety.fit_transform(df['safety'])
df['class'] = le_class.fit_transform(df['class'])

print(df.head())

   buying  maint  doors  lug_boot  safety  class
0       3      3      0         2       1      2
1       3      3      0         2       2      2
2       3      3      0         2       0      2
3       3      3      0         1       1      2
4       3      3      0         1       2      2


# Train-test Split
In this step, we will split the data into train and test sets. We will use 80% of the data for training the model and 20% for testing the model.

In [7]:
# Split the data into training and testing sets
X = df.drop(['buying'], axis=1)
y = df['buying']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
we will train a few models using the training data to compare the performance.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train and evaluate the decision tree classifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
acc_dtc = accuracy_score(y_test, y_pred)
print(f"Decision Tree Classifier Accuracy: {acc_dtc}")

# Train and evaluate the random forest classifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
acc_rfc = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {acc_rfc}")

# Train and evaluate the logistic regression classifier
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Classifier Accuracy: {acc_lr}")

# Train and evaluate the support vector machine classifier
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = accuracy_score(y_test, y_pred)
print(f"SVM Classifier Accuracy: {acc_svc}")


Decision Tree Classifier Accuracy: 0.10115606936416185
Random Forest Classifier Accuracy: 0.10404624277456648
Logistic Regression Classifier Accuracy: 0.20520231213872833
SVM Classifier Accuracy: 0.24566473988439305


# Model Inference
We will make a prediction using the train models.

In [9]:
# Use all models to make an inference
new_record = {
    "maint": "high",
    "doors": "4",
    "lug_boot": "big",
    "safety": "high",
    "class": "good"
}

# Encode the new record using the corresponding LabelEncoder instance
new_record_encoded = {
    "maint": le_maint.transform([new_record["maint"]])[0],
    "doors": le_doors.transform([new_record["doors"]])[0],
    "lug_boot": le_lug_boot.transform([new_record["lug_boot"]])[0],
    "safety": le_safety.transform([new_record["safety"]])[0],
    "class": le_class.transform([new_record["class"]])[0]
}


# Create a new dataframe using the parameter values
X_new = pd.DataFrame(new_record_encoded, index=[0])

# Use all models to make an inference
y_new_dtc = dtc.predict(X_new)
y_new_rfc = rfc.predict(X_new)
y_new_lr = lr.predict(X_new)
y_new_svc = svc.predict(X_new)

# Print out the predicted buying field using each of the trained models
print(f"Predicted buying field using Decision Tree Classifier: {le_buying.inverse_transform([y_new_dtc[0]])[0]}")
print(f"Predicted buying field using Random Forest Classifier: {le_buying.inverse_transform([y_new_rfc[0]])[0]}")
print(f"Predicted buying field using Logistic Regression Classifier: {le_buying.inverse_transform([y_new_lr[0]])[0]}")
print(f"Predicted buying field using SVM Classifier: {le_buying.inverse_transform([y_new_svc[0]])[0]}")


Predicted buying field using Decision Tree Classifier: low
Predicted buying field using Random Forest Classifier: med
Predicted buying field using Logistic Regression Classifier: med
Predicted buying field using SVM Classifier: med


Since Logistic Regression and SVM Classifier produces the same prediction and their model outperforms the other two, I would conclude that the prediction would be `med`. However, one should take the prediction lightly as even the best model accuracy is below that of a random guess (25%).

Nonetheless, each of these models are trained using the default hyperparameters. Thus, they should be treated as a baseline models. Their performance might improve if we perform hyperparameter finetuning. In the next section, we will use a AutoML package to train a finetuned model and compare the performance and prediction to these baseline models.

## AutoML using h2o
First, we need to initialize the h2o cluster

In [10]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "20.0.1" 2023-04-18; OpenJDK Runtime Environment Homebrew (build 20.0.1); OpenJDK 64-Bit Server VM Homebrew (build 20.0.1, mixed mode, sharing)
  Starting server from /usr/local/Caskroom/miniconda/base/envs/ta23/lib/python3.9/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpypfz4axk
  JVM stdout: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpypfz4axk/h2o_guohao_y_started_from_python.out
  JVM stderr: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpypfz4axk/h2o_guohao_y_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Asia/Singapore
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.4
H2O_cluster_version_age:,15 days
H2O_cluster_name:,H2O_from_python_guohao_y_8qzqpk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Once the cluster is initialized, we can load the car.data dataset into an h2o frame. Note that we do not perform encoding. This is to ensure we will be training a classifier model instead of performing regression.

In [11]:
data = pd.read_csv(url, names=columns, header=None)
data = data[['buying', 'maint', 'doors', 'lug_boot', 'safety', 'class']]
# Convert the pandas DataFrame to an h2o Frame
hf = h2o.H2OFrame(data)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [12]:
# Split the dataset into training and testing sets
train, test = hf.split_frame(ratios=[0.8])


In [13]:
# Specify the feature and target columns
X = hf.columns[1:]
y = 'buying'


In [14]:
# Set up AutoML
from h2o.automl import H2OAutoML

automl = H2OAutoML(max_models=10, seed=42)

# Train the models
automl.train(x=X, y=y, training_frame=train)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,cross_validation
Number of base models (used / total),10/10
# GBM base models (used / total),4/4
# XGBoost base models (used / total),3/3
# GLM base models (used / total),1/1
# DRF base models (used / total),2/2
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5
Metalearner fold_column,

high,low,med,vhigh,Error,Rate
61.0,23.0,80.0,175.0,0.820059,278 / 339
38.0,115.0,110.0,90.0,0.674221,238 / 353
55.0,107.0,86.0,99.0,0.7521614,261 / 347
85.0,14.0,77.0,159.0,0.5253731,176 / 335
239.0,259.0,353.0,523.0,0.6935953,"953 / 1,374"

k,hit_ratio
1,0.3064047
2,0.6026201
3,0.8020378
4,1.0

high,low,med,vhigh,Error,Rate
140.0,46.0,63.0,90.0,0.5870206,199 / 339
63.0,134.0,71.0,85.0,0.6203966,219 / 353
68.0,70.0,135.0,74.0,0.610951,212 / 347
75.0,37.0,51.0,172.0,0.4865672,163 / 335
346.0,287.0,320.0,421.0,0.577147,"793 / 1,374"

k,hit_ratio
1,0.422853
2,0.6928675
3,0.8617176
4,1.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.424212,0.0250924,0.4379562,0.4501845,0.4366197,0.3897059,0.4065934
auc,,0.0,,,,,
err,0.5757881,0.0250924,0.5620438,0.5498155,0.5633803,0.6102941,0.5934066
err_count,158.2,6.7230945,154.0,149.0,160.0,166.0,162.0
logloss,1.222757,0.030853,1.2149937,1.1866702,1.2059926,1.2402774,1.2658511
max_per_class_error,0.6596483,0.0220906,0.6760563,0.6461539,0.6282051,0.6666667,0.6811594
mean_per_class_accuracy,0.4269377,0.0252628,0.4441698,0.4477034,0.4434066,0.393782,0.4056267
mean_per_class_error,0.5730623,0.0252628,0.5558302,0.5522966,0.5565934,0.6062179,0.5943732
mse,0.473103,0.0117878,0.467533,0.4595512,0.4673882,0.4856672,0.4853754
null_deviance,762.862,14.158211,761.2242,752.9439,787.5973,755.251,757.2937


In [15]:
# Get the best model
best_model = automl.leader

# Get the best model from AutoML
best_model = automl.leader

# Predict on test set
test_preds = best_model.predict(test)

# Extract actual values for target variable in test set
y_test = test[y]
y_test_list = y_test.as_data_frame().iloc[:,0].tolist()

# Convert predicted values to a Pandas DataFrame and extract the predictions for target variable
y_test_hat = test_preds.as_data_frame()['predict']

accuracy = accuracy_score(y_test_list, y_test_hat)
print(f"AutoML Classifier Accuracy: {accuracy}")


stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
AutoML Classifier Accuracy: 0.4180790960451977


The accuracy almost doubled after performing hyperparameter finetuning using AutoML.
Next, we will attempt to do an inference and see if the prediction is still the same.

In [16]:
# Use the best model to make a prediction
new_df = pd.DataFrame(new_record, index=[0])
prediction = best_model.predict(h2o.H2OFrame(new_df)).as_data_frame()['predict'][0]
prediction

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


'low'

It turns out the the AutoML model also predicted the buying price to be of `med` class too!