#Project Title: Regression of Used Car Prices" Predicting Car Prices Using Machine Learning Models"

Contributor: Rajeev Singh Sisodiya

#Project Overview:
The goal of this project is to predict the prices of used cars based on various car attributes using advanced regression techniques. This problem is part of the 2024 Kaggle Playground Series, where datasets are designed for practicing and sharpening machine learning skills. The dataset includes attributes such as engine type, cylinder count, horsepower, and other characteristics.

The solution involves:

Data preprocessing (handling missing values, feature extraction, and scaling).
Feature engineering (extracting details like engine displacement, cylinder configuration, and transmission type).
Implementing and comparing different machine learning

models, including:

XGBoost (XGBRegressor)

LightGBM (LGBMRegressor)

Voting Regressor (combining the predictions of XGBoost and LightGBM)

By employing these models, we aim to minimize prediction errors (measured by RMSE) and maximize prediction accuracy (measured by R² score). After model evaluation, predictions on test data are submitted in the form of a CSV file.

To integrate modern machine learning techniques like Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), Computer Vision, and Generative AI into the car price prediction project, we can enhance our pipeline with these techniques, focusing on different aspects of the data.

ANN for Tabular Data Prediction:
We can use a simple ANN model for structured data prediction tasks.

RNN for Time Series Features:
If any temporal data (like model_year) can be considered as a time series feature, we can use RNN to model dependencies over time.

Computer Vision for Image Processing (if applicable): If the dataset contains images (e.g., car pictures), we can use convolutional neural networks (CNN) for image analysis.

Generative AI for Data Augmentation:
We can employ generative models to create synthetic data for training if the dataset is small.

#Key Steps:
Feature Engineering:
Extracting numerical values (horsepower, cylinder count) from text-based columns (like engine).
Handling missing values and scaling the data appropriately.

Modeling Approach:
Training and evaluating advanced ensemble methods such as XGBoost and LightGBM.
Using a Voting Regressor to blend the predictions from both models to improve accuracy.

# Import Libraries

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import re
from datetime import date
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import IsolationForest, VotingRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout



In [7]:
# Check for GPU
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


Num GPUs Available:  0


In [8]:
# Load Data
train = pd.read_csv('/content/train.csv')
test = pd.read_csv('/content/test.csv')


In [9]:
# Deep Copy
df_train = copy.deepcopy(train)
df_test = copy.deepcopy(test)


In [10]:
# Display object type columns and their values
for col in df_train.select_dtypes(include='object'):
    print(col)
    print(df_train[col].nunique())
    print(df_train[col].unique())
    print('-----------------------------------------')


brand
57
['MINI' 'Lincoln' 'Chevrolet' 'Genesis' 'Mercedes-Benz' 'Audi' 'Ford'
 'BMW' 'Tesla' 'Cadillac' 'Land' 'GMC' 'Toyota' 'Hyundai' 'Volvo'
 'Volkswagen' 'Buick' 'Rivian' 'RAM' 'Hummer' 'Alfa' 'INFINITI' 'Jeep'
 'Porsche' 'McLaren' 'Honda' 'Lexus' 'Dodge' 'Nissan' 'Jaguar' 'Acura'
 'Kia' 'Mitsubishi' 'Rolls-Royce' 'Maserati' 'Pontiac' 'Saturn' 'Bentley'
 'Mazda' 'Subaru' 'Ferrari' 'Aston' 'Lamborghini' 'Chrysler' 'Lucid'
 'Lotus' 'Scion' 'smart' 'Karma' 'Plymouth' 'Suzuki' 'FIAT' 'Saab'
 'Bugatti' 'Mercury' 'Polestar' 'Maybach']
-----------------------------------------
model
1897
['Cooper S Base' 'LS V8' 'Silverado 2500 LT' ... 'e-Golf SE'
 'Integra w/A-Spec Tech Package' 'IONIQ Plug-In Hybrid SEL']
-----------------------------------------
fuel_type
7
['Gasoline' 'E85 Flex Fuel' nan 'Hybrid' 'Diesel' 'Plug-In Hybrid' '–'
 'not supported']
-----------------------------------------
engine
1117
['172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel'
 '252.0HP 3.9L 8 Cylinder Engine Gasolin

In [11]:
# Extract engine features
def extract_engine_features(df):
    df['Horse_power'] = df['engine'].str.extract(r'(\d+)\.?\d*HP').astype(float)
    df['Engine_Displacement'] = df['engine'].str.extract(r'(\d+\.?\d*)L').astype(float)
    df['Cylinder_Count'] = df['engine'].str.extract(r'(\d+) Cylinder ').astype(float)
    return df

df_train = extract_engine_features(df_train)
df_test = extract_engine_features(df_test)


In [12]:
# Extract engine details (cylinder config, turbocharger, engine type, fuel systems)
cylinder_config = {'V', 'Flat', 'Straight'}
Turbocharger = {'Turbo', 'Twin Turbo'}
engine_type = {'Gasoline', 'Electric', 'Hybrid'}
FUEL_SYSTEMS = {'MPFI', 'GDI', 'PDI', 'TFSI', 'DOHC', 'SOHC'}

def extract_engine_components(df, component, keywords):
    df[component] = df['engine'].apply(lambda x: next((kw for kw in keywords if kw in x), 'Nan'))
    return df

df_train = extract_engine_components(df_train, 'cylinder_config', cylinder_config)
df_train = extract_engine_components(df_train, 'Turbocharger', Turbocharger)
df_train = extract_engine_components(df_train, 'engine_type', engine_type)
df_train = extract_engine_components(df_train, 'FUEL_SYSTEMS', FUEL_SYSTEMS)

df_test = extract_engine_components(df_test, 'cylinder_config', cylinder_config)
df_test = extract_engine_components(df_test, 'Turbocharger', Turbocharger)
df_test = extract_engine_components(df_test, 'engine_type', engine_type)
df_test = extract_engine_components(df_test, 'FUEL_SYSTEMS', FUEL_SYSTEMS)



In [13]:
# Transmission extraction
def extract_transmission(word):
    def get_number():
        n = re.findall('\d+', str(word))
        return n[0] if n else ''

    if any(i in str(word) for i in ['AT', 'A/T', 'At/Mt', 'Automatic']):
        return 'AT' + get_number()
    elif 'CVT' in str(word):
        return 'CVT' + get_number()
    elif any(i in str(word) for i in ['Manual', 'M/T']):
        return 'MT' + get_number()
    else:
        return 'other'

df_train['new_transmission'] = df_train['transmission'].apply(extract_transmission)
df_test['new_transmission'] = df_test['transmission'].apply(extract_transmission)



In [14]:
# Drop unnecessary columns
df_train.drop(['engine', 'transmission', 'model'], axis=1, inplace=True)
df_test.drop(['engine', 'transmission', 'model'], axis=1, inplace=True)


In [18]:
# Handling missing values
num_cols = df_train.select_dtypes(include='number').columns
cat_cols = df_train.select_dtypes(include='object').columns

num_imputer = SimpleImputer(strategy='mean')
cat_imputer = SimpleImputer(strategy='most_frequent')
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
scaler = StandardScaler()

# Check if 'price' column exists in df_test before applying transformations
if 'price' in df_test.columns:
    for col in num_cols:
        df_train[col] = num_imputer.fit_transform(df_train[[col]])
        df_test[col] = num_imputer.transform(df_test[[col]])
        df_train[col] = scaler.fit_transform(df_train[[col]])
        df_test[col] = scaler.transform(df_test[[col]])

for col in cat_cols:
    # Use single brackets and ravel to pass a 1D array to fit_transform
    df_train[col] = cat_imputer.fit_transform(df_train[col].values.reshape(-1, 1)).ravel()
    df_test[col] = cat_imputer.transform(df_test[col].values.reshape(-1, 1)).ravel()
    df_train[col] = encoder.fit_transform(df_train[col].values.reshape(-1, 1)).ravel()
    df_test[col] = encoder.transform(df_test[col].values.reshape(-1, 1)).ravel()

In [20]:
# Isolation Forest for outlier detection
isolation_forest = IsolationForest(contamination=0.024, random_state=42)

# Check for NaN values and handle them before fitting the model
if df_train.isnull().values.any():
  # Impute NaN values with the mean for numerical columns and most frequent for categorical columns
  df_train.fillna(df_train.mean(numeric_only=True), inplace=True)
  df_train.fillna(df_train.mode().iloc[0], inplace=True)

x_train_labels = isolation_forest.fit_predict(df_train)
normal_bool = x_train_labels != -1
df_train = df_train[normal_bool]

In [21]:
# Model Setup
x_train, x_test, y_train, y_test = train_test_split(df_train.drop('price', axis=1), df_train['price'], test_size=0.2, random_state=42)


In [22]:
# XGBoost and LightGBM model setup
xgb_model = xgb.XGBRegressor(tree_method="hist", device="cuda", objective="reg:squarederror", eval_metric="rmse",
                             random_state=42, colsample_bytree=0.45, learning_rate=0.025, max_depth=7, n_estimators=3000,
                             reg_alpha=0.001, reg_lambda=0.001, min_child_weight=18, verbosity=0, enable_categorical=True)

lgb_model = LGBMRegressor(n_epochs=1000, learning_rate=0.01, colsample_bytree=0.55, bagging_fraction=0.8, num_leaves=3072,
                          min_child_samples=12, reg_lambda=64, max_bin=255, max_depth=10, reg_alpha=0, verbose=-1)


In [23]:
# Train Models
xgb_model.fit(x_train, y_train)
lgb_model.fit(x_train, y_train)


In [24]:
# Voting Regressor
vtr = VotingRegressor(estimators=[('xgboost', xgb_model), ('lightgbm', lgb_model)])
vtr.fit(x_train, y_train)


In [25]:
# Predictions
y_pred_xgb = xgb_model.predict(x_test)
y_pred_lgb = lgb_model.predict(x_test)
y_pred_vtr = vtr.predict(x_test)

In [26]:
# Evaluation
mse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))
mse_lgb = np.sqrt(mean_squared_error(y_test, y_pred_lgb))
mse_vtr = np.sqrt(mean_squared_error(y_test, y_pred_vtr))

In [27]:
print("XGBoost MSE:", mse_xgb)
print("LightGBM MSE:", mse_lgb)
print("Voting Regressor MSE:", mse_vtr)

XGBoost MSE: 55735.64240506938
LightGBM MSE: 56244.16193475547
Voting Regressor MSE: 55376.16028419031


In [29]:
# Submission File
submission = pd.DataFrame({
    'id': test['id'],
    'price': vtr.predict(df_test)
})
submission.to_csv('submission.csv', index=False)

Let's add an ANN model to your pipeline for used car price predictions, keeping the current XGBoost and LightGBM models for comparison.

Steps:

Prepare the data pipeline as you already did.

Add an ANN model for the prediction task.

Compare ANN performance with XGBoost and LightGBM.

In [30]:
# ANN Model
def create_ann_model():
    model = Sequential([
        Dense(128, activation='relu', input_shape=(x_train.shape[1],)),
        Dropout(0.2),
        Dense(64, activation='relu'),
        Dropout(0.2),
        Dense(32, activation='relu'),
        Dense(1)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

ann_model = create_ann_model()
ann_model.fit(scaler.fit_transform(x_train), y_train, epochs=50, batch_size=32, validation_split=0.2)

# ANN Predictions
y_pred_ann = ann_model.predict(scaler.transform(x_test))
mse_ann = mean_squared_error(y_test, y_pred_ann)
r2_ann = r2_score(y_test, y_pred_ann)

print("ANN MSE:", mse_ann)
print("ANN R2 Score:", r2_ann)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 3ms/step - loss: 4071098624.0000 - mae: 24393.2832 - val_loss: 2965001728.0000 - val_mae: 17404.6270
Epoch 2/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 3ms/step - loss: 2788294400.0000 - mae: 17484.9844 - val_loss: 2956114944.0000 - val_mae: 16880.9043
Epoch 3/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 7ms/step - loss: 2588393216.0000 - mae: 17379.8945 - val_loss: 2949027072.0000 - val_mae: 16856.8203
Epoch 4/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 3ms/step - loss: 2862849280.0000 - mae: 17539.0000 - val_loss: 2943825920.0000 - val_mae: 16835.7500
Epoch 5/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 3ms/step - loss: 2777673472.0000 - mae: 17425.4785 - val_loss: 2939969792.0000 - val_mae: 16866.0664
Epoch 6/50
[1m3681/3681[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 3ms/step - loss


Model Performance:

The MSE is quite high, indicating that the model's predictions are significantly different from the actual values.

The R2 score of 0.162 suggests that the model is only explaining about 16% of the variance in the data. This is a low score, meaning the model is not capturing the underlying patterns in the dataset effectively.

Implications:

A high MSE combined with a low R2 score points to poor model fit, implying that the ANN is not making accurate predictions and might not generalize well to unseen data.


In [33]:
# Import the necessary module
!pip install scikeras


Collecting scikeras
  Downloading scikeras-0.13.0-py3-none-any.whl.metadata (3.1 kB)
Collecting scikit-learn>=1.4.2 (from scikeras)
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading scikeras-0.13.0-py3-none-any.whl (26 kB)
Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn, scikeras
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.3.2
    Uninstalling scikit-learn-1.3.2:
      Successfully uninstalled scikit-learn-1.3.2
Successfully installed scikeras-0.13.0 scikit-learn-1.5.1


In [35]:
# Import the necessary module
from scikeras.wrappers import KerasRegressor # Use Scikeras instead of tensorflow.keras.wrappers.scikit_learn

# Voting Regressor with ANN
# Wrap ann_model with KerasRegressor
vtr_ann = VotingRegressor(estimators=[('xgboost', xgb_model), ('lightgbm', lgb_model), ('ann', KerasRegressor(build_fn=create_ann_model))])
vtr_ann.fit(scaler.fit_transform(x_train), y_train)
y_pred_vtr_ann = vtr_ann.predict(scaler.transform(x_test))

  X, y = self._initialize(X, y)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m4601/4601[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 4ms/step - loss: 3201067520.0000 - mae: 22739.7461
[1m1151/1151[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step


In [36]:
# Evaluation with ANN
mse_vtr_ann = mean_squared_error(y_test, y_pred_vtr_ann)
print("Voting Regressor with ANN MSE:", mse_vtr_ann)

Voting Regressor with ANN MSE: 3062428597.8454475


The result of the Voting Regressor with ANN yielding an MSE (Mean Squared Error) of 3,062,428,597.85 suggests that the model's predictions are still quite far from the actual values. This high MSE indicates that the model may not be performing well in capturing the relationship between the input features and the target variable (price in this case).

#Conclusion:
Model Performance: The high MSE signifies that the ensemble model, even with the inclusion of ANN, is not accurate. There may be a need for further optimization in terms of feature engineering, model tuning, or data preprocessing.
