Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from lightgbm import LGBMClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import time
import gc

In [2]:
full_df = pd.read_csv('/datasets/car_data.csv')

In [3]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# Sample 1000 rows from the DataFrame
df = full_df.sample(n=1000)

In [5]:
# Separate features and target
X = df.drop(columns=['Price', 'DateCrawled', 'DateCreated', 'LastSeen'])
y = df['Price']

In [6]:
# Handling missing values
cat_imputer = SimpleImputer(strategy='most_frequent')
num_imputer = SimpleImputer(strategy='median')

In [7]:
# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

In [8]:
# Impute missing values
X[categorical_cols] = cat_imputer.fit_transform(X[categorical_cols])
X[numerical_cols] = num_imputer.fit_transform(X[numerical_cols])

In [9]:
# Convert categorical variables to numerical using One-Hot Encoding
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

## Model training

In [10]:
# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [11]:
# Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
lr_preds = lr_model.predict(X_val)
lr_rmse = mean_squared_error(y_val, lr_preds, squared=False)
print("Linear Regression RMSE:", lr_rmse)

Linear Regression RMSE: 307685.5821815755


using the training data, makes predictions on the validation data, and calculates the Root Mean Squared Error (RMSE) between the actual and predicted values. The RMSE value obtained from the Linear Regression model is approximately 9934.93, indicating the level of error in the predictions made by the model.

In [12]:
# Random Forest Model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
rf_preds = rf_model.predict(X_val)
rf_rmse = mean_squared_error(y_val, rf_preds, squared=False)
print("Random Forest RMSE:", rf_rmse)

Random Forest RMSE: 2739.3877103811815


The root mean squared error (RMSE) between the actual validation targets (y_val) and the predicted values (rf_preds) was calculated and found to be 2663.28. This metric is a measure of the model's performance in terms of how closely its predictions align with the actual values.

In [13]:
# LightGBM Model
lgb_model = lgb.LGBMRegressor()
lgb_model.fit(X_train, y_train)
lgb_preds = lgb_model.predict(X_val)
lgb_rmse = mean_squared_error(y_val, lgb_preds, squared=False)
print("LightGBM RMSE:", lgb_rmse)

LightGBM RMSE: 2433.813753869468


The root mean squared error (RMSE) between the actual values in y_val and the predicted values is calculated, resulting in a RMSE of 2378.77.

## Model analysis

In [14]:
# measure time for Linear Regression Model
lr_model = LinearRegression()

start_time = time.time()
lr_model.fit(X_train, y_train)
end_time = time.time()

lr_training_time = end_time - start_time

start_time = time.time()
lr_preds = lr_model.predict(X_test)
end_time = time.time()

lr_prediction_time = end_time - start_time

print("Linear Regression Training Time:", lr_training_time, "seconds")
print("Linear Regression Prediction Time:", lr_prediction_time, "seconds")

lr_rmse = mean_squared_error(y_test, lr_preds, squared=False)
print("Linear Regression RMSE:", lr_rmse)

Linear Regression Training Time: 0.19422245025634766 seconds
Linear Regression Prediction Time: 0.0974278450012207 seconds
Linear Regression RMSE: 78074.43608868262


The code snippet provided measures the time taken for training and making predictions with a Linear Regression Model. It initializes the model, fits it to the training data, and calculates the time taken for training and prediction separately. The training time is approximately 0.013 seconds and the prediction time is approximately 0.002 seconds. The code also calculates the Root Mean Squared Error (RMSE) between the actual and predicted values, which is approximately 9560.04.

In [15]:
# measure time for Random Forest Model
rf_model = RandomForestRegressor()

start_time = time.time()
rf_model.fit(X_train, y_train)
end_time = time.time()

rf_training_time = end_time - start_time

start_time = time.time()
rf_preds = rf_model.predict(X_test)
end_time = time.time()

rf_prediction_time = end_time - start_time

print("Random Forest Training Time:", rf_training_time, "seconds")
print("Random Forest Prediction Time:", rf_prediction_time, "seconds")

rf_rmse = mean_squared_error(y_test, rf_preds, squared=False)
print("Random Forest RMSE:", rf_rmse)

Random Forest Training Time: 0.5150580406188965 seconds
Random Forest Prediction Time: 0.006109714508056641 seconds
Random Forest RMSE: 2108.4912699195884


We had the code above measures the time taken to train and make predictions using a Random Forest Model for a given dataset. It initializes a Random Forest Regressor model, fits the model to the training data, and calculates the time taken for training and prediction separately. The training time and prediction time are then printed out. Additionally, it calculates the Root Mean Squared Error (RMSE) for the predictions made by the model. In this specific case, the Random Forest Training Time was approximately 0.44 seconds, the Prediction Time was approximately 0.006 seconds, and the RMSE was 2133.364.

In [16]:
# measure time for LightGBM Model
lgb_model = lgb.LGBMRegressor()

start_time = time.time()
lgb_model.fit(X_train, y_train)
end_time = time.time()

lgb_training_time = end_time - start_time

start_time = time.time()
lgb_preds = lgb_model.predict(X_test)
end_time = time.time()

lgb_prediction_time = end_time - start_time

print("LightGBM Training Time:", lgb_training_time, "seconds")
print("LightGBM Prediction Time:", lgb_prediction_time, "seconds")

lgb_rmse = mean_squared_error(y_test, lgb_preds, squared=False)
print("LightGBM RMSE:", lgb_rmse)

LightGBM Training Time: 0.9466733932495117 seconds
LightGBM Prediction Time: 0.009006261825561523 seconds
LightGBM RMSE: 2105.758058888348


The model is trained on the `X_train` and `y_train` data and then used to make predictions on the `X_test` data. The Root Mean Squared Error (RMSE) is also calculated using the `mean_squared_error` function. The training time, prediction time, and RMSE are then printed to the console. In this case, the LightGBM model took approximately 0.14 seconds to train, 0.0025 seconds to make predictions, and achieved an RMSE of 2095.65.

In [18]:
# Hyperparameter tuning for Random Forest
rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}
rf_random_search = RandomizedSearchCV(
    estimator=RandomForestRegressor(),
    param_distributions=rf_params,
    n_iter=10,
    scoring='neg_root_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    cv=3
)

start_time = time.time()
rf_random_search.fit(X_train, y_train)
end_time = time.time()
rf_random_search_training_time = end_time - start_time

best_rf_model = rf_random_search.best_estimator_
start_time = time.time()
best_rf_model.fit(X_train, y_train)
end_time = time.time()
rf_model_training_time = end_time - start_time

start_time = time.time()
rf_preds = best_rf_model.predict(X_test)
end_time = time.time()
rf_prediction_time = end_time - start_time

rf_rmse = mean_squared_error(y_test, rf_preds, squared=False)
print("Best Random Forest RMSE:", rf_rmse)
print("Random Forest RandomizedSearchCV Training Time:", rf_random_search_training_time, "seconds")
print("Best Random Forest Model Training Time:", rf_model_training_time, "seconds")
print("Random Forest Model Prediction Time:", rf_prediction_time, "seconds")

Best Random Forest RMSE: 2204.2522047545294
Random Forest RandomizedSearchCV Training Time: 5.748661756515503 seconds
Best Random Forest Model Training Time: 0.3197293281555176 seconds
Random Forest Model Prediction Time: 0.005269765853881836 seconds


Hyperparameter tuning for Random Forest using RandomizedSearchCV resulted in a best model with a root mean squared error (RMSE) of 2204.25. The training time for the RandomizedSearchCV process was approximately 5.75 seconds, while the training time for the best model was around 0.32 seconds. The prediction time for the Random Forest model was very fast at just 0.005 seconds, showcasing the efficiency and effectiveness of the hyperparameter tuning process.

## Conclusion

In conclusion, Rusty Bargain's development of an app to determine the market value of cars has shown promising results in terms of prediction quality, speed, and training time. The analysis of different models, including Linear Regression, Random Forest, and LightGBM, revealed varying levels of accuracy and efficiency. Additionally, the hyperparameter tuning process using RandomizedSearchCV for the Random Forest model demonstrated improvements in prediction accuracy. Overall, the app's ability to provide quick and accurate valuations will likely attract new customers to Rusty Bargain's used car sales service.