# Predicting car price
In this notebook, we will be building machine learning models to predict the car price. The dataset is from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

Once the model is trained, we will predict the buying price using the best model, given the following parameters:

Maintenance = High
Number of doors = 4
Lug Boot Size = Big
Safety = High
Class Value = Good


# Install dependencies
Install the required python packages if it is not installed yet.

In [36]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip:****@gitlab.com/api/v4/projects/37926211/packages/pypi/simple
Collecting scikit-learn==1.2.2
  Using cached scikit_learn-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl (9.1 MB)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.24.2
    Uninstalling scikit-learn-0.24.2:
      Successfully uninstalled scikit-learn-0.24.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
auto-sklearn 0.15.0 requires scikit-learn<0.25.0,>=0.24.0, but you have scikit-learn 1.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-1.2.2


# Import Packages
These libraries will be used for data handling, model building, and saving the model.

In [91]:
# Importing necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import h2o


# Load and preprocess the data
Load the dataset from the given URL using pandas. Then, we will preprocess the data by converting categorical features into numerical values. A preview of the data before and after preprocessing are printed below.

In [120]:
# Loading the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
data = pd.read_csv(url, names=columns, header=None)
data = data.drop('persons', axis=1)  # persons is dropped as it is not provided in the parameters to use for inference
data.head()

Unnamed: 0,buying,maint,doors,lug_boot,safety,class
0,vhigh,vhigh,2,small,low,unacc
1,vhigh,vhigh,2,small,med,unacc
2,vhigh,vhigh,2,small,high,unacc
3,vhigh,vhigh,2,med,low,unacc
4,vhigh,vhigh,2,med,med,unacc


In [121]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   lug_boot  1728 non-null   object
 4   safety    1728 non-null   object
 5   class     1728 non-null   object
dtypes: object(6)
memory usage: 81.1+ KB


In [107]:
df = data
# Convert categorical variables to numerical variables
# Create separate LabelEncoder instances for each column
le_buying = LabelEncoder()
le_maint = LabelEncoder()
le_doors = LabelEncoder()
le_lug_boot = LabelEncoder()
le_safety = LabelEncoder()
le_class = LabelEncoder()
df['buying'] = le_buying.fit_transform(df['buying'])
df['maint'] = le_maint.fit_transform(df['maint'])
df['doors'] = le_doors.fit_transform(df['doors'])
df['lug_boot'] = le_lug_boot.fit_transform(df['lug_boot'])
df['safety'] = le_safety.fit_transform(df['safety'])
df['class'] = le_class.fit_transform(df['class'])

print(df.head())

   buying  maint  doors  lug_boot  safety  class
0       3      3      0         2       1      2
1       3      3      0         2       2      2
2       3      3      0         2       0      2
3       3      3      0         1       1      2
4       3      3      0         1       2      2


In [88]:
# Get the value counts for all columns in the dataframe
for col in df.columns:
    print(f"\nValue counts for {col}:")
    print(df[col].value_counts())


Value counts for buying:
3    432
0    432
2    432
1    432
Name: buying, dtype: int64

Value counts for maint:
3    432
0    432
2    432
1    432
Name: maint, dtype: int64

Value counts for doors:
0    432
1    432
2    432
3    432
Name: doors, dtype: int64

Value counts for lug_boot:
2    576
1    576
0    576
Name: lug_boot, dtype: int64

Value counts for safety:
1    576
2    576
0    576
Name: safety, dtype: int64

Value counts for class:
2    1210
0     384
1      69
3      65
Name: class, dtype: int64


# Train-test Split
In this step, we will split the data into train and test sets. We will use 80% of the data for training the model and 20% for testing the model.

In [76]:
# Split the data into training and testing sets
X = df.drop(['buying'], axis=1)
y = df['buying']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
we will train a few models using the training data to compare the performance.

In [77]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train and evaluate the decision tree classifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
acc_dtc = accuracy_score(y_test, y_pred)
print(f"Decision Tree Classifier Accuracy: {acc_dtc}")

# Train and evaluate the random forest classifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
acc_rfc = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {acc_rfc}")

# Train and evaluate the logistic regression classifier
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
acc_lr = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Classifier Accuracy: {acc_lr}")

# Train and evaluate the support vector machine classifier
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = accuracy_score(y_test, y_pred)
print(f"SVM Classifier Accuracy: {acc_svc}")


Decision Tree Classifier Accuracy: 0.10115606936416185
Random Forest Classifier Accuracy: 0.10404624277456648
Logistic Regression Classifier Accuracy: 0.20520231213872833
SVM Classifier Accuracy: 0.24566473988439305


# Model Inference
We will make a prediction using the train models.

In [87]:
# Use all models to make an inference
new_record = {
    "maint": "high",
    "doors": "4",
    "lug_boot": "big",
    "safety": "high",
    "class": "good"
}

# Encode the new record using the corresponding LabelEncoder instance
new_record_encoded = {
    "maint": le_maint.transform([new_record["maint"]])[0],
    "doors": le_doors.transform([new_record["doors"]])[0],
    "lug_boot": le_lug_boot.transform([new_record["lug_boot"]])[0],
    "safety": le_safety.transform([new_record["safety"]])[0],
    "class": le_class.transform([new_record["class"]])[0]
}


# Create a new dataframe using the parameter values
X_new = pd.DataFrame(new_record_encoded, index=[0])

# Use all models to make an inference
y_new_dtc = dtc.predict(X_new)
y_new_rfc = rfc.predict(X_new)
y_new_lr = lr.predict(X_new)
y_new_svc = svc.predict(X_new)

# Print out the predicted buying field using each of the trained models
print(f"Predicted buying field using Decision Tree Classifier: {le_buying.inverse_transform([y_new_dtc[0]])[0]}")
print(f"Predicted buying field using Random Forest Classifier: {le_buying.inverse_transform([y_new_rfc[0]])[0]}")
print(f"Predicted buying field using Logistic Regression Classifier: {le_buying.inverse_transform([y_new_lr[0]])[0]}")
print(f"Predicted buying field using SVM Classifier: {le_buying.inverse_transform([y_new_svc[0]])[0]}")


Predicted buying field using Decision Tree Classifier: low
Predicted buying field using Random Forest Classifier: med
Predicted buying field using Logistic Regression Classifier: med
Predicted buying field using SVM Classifier: med


Since Logistic Regression and SVM Classifier produces the same prediction and their model outperforms the other two, I would conclude that the prediction would be `med`. However, one should take the prediction lightly as even the best model accuracy is below that of a random guess (25%).

Nonetheless, each of these models are trained using the default hyperparameters. Thus, they should be treated as a baseline models. Their performance might improve if we perform hyperparameter finetuning. In the next section, we will use a AutoML package to train a finetuned model and compare the performance and prediction to these baseline models.

## AutoML using h2o
First, we need to initialize the h2o cluster

In [92]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "20.0.1" 2023-04-18; OpenJDK Runtime Environment Homebrew (build 20.0.1); OpenJDK 64-Bit Server VM Homebrew (build 20.0.1, mixed mode, sharing)
  Starting server from /usr/local/Caskroom/miniconda/base/envs/ta23/lib/python3.9/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpb0pdhiz2
  JVM stdout: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpb0pdhiz2/h2o_guohao_y_started_from_python.out
  JVM stderr: /var/folders/2q/6jnqt8_d2h5f8_1mjfwc06480000gr/T/tmpb0pdhiz2/h2o_guohao_y_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Asia/Singapore
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.40.0.4
H2O_cluster_version_age:,15 days
H2O_cluster_name:,H2O_from_python_guohao_y_vr91si
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Once the cluster is initialized, we can load the car.data dataset into an h2o frame

In [123]:
# Convert the pandas DataFrame to an h2o Frame
hf = h2o.H2OFrame(data)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [124]:
# Split the dataset into training and testing sets
train, test = hf.split_frame(ratios=[0.8])


In [125]:
# Specify the feature and target columns
X = hf.columns[1:]
y = 'buying'


In [126]:
# Set up AutoML
from h2o.automl import H2OAutoML

automl = H2OAutoML(max_models=10, seed=42)

# Train the models
automl.train(x=X, y=y, training_frame=train)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,cross_validation
Number of base models (used / total),10/10
# GBM base models (used / total),4/4
# XGBoost base models (used / total),3/3
# GLM base models (used / total),1/1
# DRF base models (used / total),2/2
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5
Metalearner fold_column,

high,low,med,vhigh,Error,Rate
81.0,27.0,73.0,164.0,0.7652174,264 / 345
57.0,33.0,166.0,74.0,0.9,297 / 330
73.0,59.0,121.0,92.0,0.6492754,224 / 345
121.0,22.0,60.0,139.0,0.5935673,203 / 342
332.0,141.0,420.0,469.0,0.7254038,"988 / 1,362"

k,hit_ratio
1,0.2745962
2,0.5947137
3,0.8039648
4,1.0

high,low,med,vhigh,Error,Rate
158.0,41.0,66.0,80.0,0.542029,187 / 345
70.0,108.0,63.0,89.0,0.6727273,222 / 330
74.0,53.0,131.0,87.0,0.6202899,214 / 345
77.0,27.0,49.0,189.0,0.4473684,153 / 342
379.0,229.0,309.0,445.0,0.5697504,"776 / 1,362"

k,hit_ratio
1,0.4302496
2,0.6989721
3,0.8722467
4,1.0

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.4287155,0.0248796,0.4343066,0.3851852,0.437276,0.4386617,0.4481482
auc,,0.0,,,,,
err,0.5712845,0.0248796,0.5656934,0.6148148,0.562724,0.5613383,0.5518519
err_count,155.6,6.618157,155.0,166.0,157.0,151.0,149.0
logloss,1.2264715,0.0271345,1.2223587,1.2634153,1.2149441,1.240351,1.1912885
max_per_class_error,0.690912,0.0621166,0.6617647,0.7794118,0.7260274,0.6206896,0.6666667
mean_per_class_accuracy,0.4291309,0.0304035,0.4367122,0.3753537,0.4428055,0.4413783,0.4494048
mean_per_class_error,0.5708692,0.0304035,0.5632879,0.6246462,0.5571945,0.5586218,0.5505952
mse,0.473647,0.0131793,0.4743715,0.4944717,0.4675651,0.4731234,0.4587031
null_deviance,756.15155,11.660421,760.50385,750.2248,774.6756,746.4502,748.9034


In [153]:
# Get the best model
best_model = automl.leader

# Get the best model from AutoML
best_model = automl.leader

# Predict on test set
test_preds = best_model.predict(test)

# Extract actual values for target variable in test set
y_test = test[y]
y_test_list = y_test.as_data_frame().iloc[:,0].tolist()

# Convert predicted values to a Pandas DataFrame and extract the predictions for target variable
y_test_hat = test_preds.as_data_frame()['predict']

accuracy = accuracy_score(y_test_list, y_test_hat)
print(f"AutoML Classifier Accuracy: {accuracy}")


stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
AutoML Classifier Accuracy: 0.42349726775956287


The accuracy almost doubled after performing hyperparameter finetuning using AutoML.
Next, we will attempt to do an inference and see if the prediction is still the same.

In [156]:
# Use the best model to make a prediction
new_df = pd.DataFrame(new_record, index=[0])
prediction = best_model.predict(h2o.H2OFrame(new_df)).as_data_frame()['predict'][0]
prediction

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


'med'

It turns out the the AutoML model also predicted the buying price to be of `med` class!