# Rusty Bargain Predictive Model

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. With access to historical data: technical specifications, trim versions, and prices. We build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Import Libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor

import time

## Data preparation

In [2]:
# Load the dataset
df = pd.read_csv('/datasets/car_data.csv')

In [3]:
# Display the first few rows
df.head()


Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


 We rename the colums to make the code look more consistent with its style.

In [4]:
df = df.rename(columns={'DateCrawled':'date_crawled', 'Price':'price', 'VehicleType':'vehicle_type', 'RegistrationYear':'registration_year', 'Gearbox':'gearbox', 'Power':'power', 'Model':'model', 'Mileage':'mileage', 'RegistrationMonth':'registration_month', 'FuelType':'fuel_type','Brand':'brand', 'NotRepaired':'not_repaired','DateCreated':'date_created','NumberOfPictures':'number_of_pictures', 'PostalCode':'postal_code', 'LastSeen':'last_seen'})

In [5]:
# Summary of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   date_crawled        354369 non-null  object
 1   price               354369 non-null  int64 
 2   vehicle_type        316879 non-null  object
 3   registration_year   354369 non-null  int64 
 4   gearbox             334536 non-null  object
 5   power               354369 non-null  int64 
 6   model               334664 non-null  object
 7   mileage             354369 non-null  int64 
 8   registration_month  354369 non-null  int64 
 9   fuel_type           321474 non-null  object
 10  brand               354369 non-null  object
 11  not_repaired        283215 non-null  object
 12  date_created        354369 non-null  object
 13  number_of_pictures  354369 non-null  int64 
 14  postal_code         354369 non-null  int64 
 15  last_seen           354369 non-null  object
dtypes:

In [6]:
# Check for missing values
missing_values = df.isnull().sum()

# Check for duplicate rows
duplicate_rows = df.duplicated().sum()

# Check for extreme values using summary statistics
summary_stats = df.describe()

missing_values, duplicate_rows, summary_stats

(date_crawled              0
 price                     0
 vehicle_type          37490
 registration_year         0
 gearbox               19833
 power                     0
 model                 19705
 mileage                   0
 registration_month        0
 fuel_type             32895
 brand                     0
 not_repaired          71154
 date_created              0
 number_of_pictures        0
 postal_code               0
 last_seen                 0
 dtype: int64,
 262,
                price  registration_year          power        mileage  \
 count  354369.000000      354369.000000  354369.000000  354369.000000   
 mean     4416.656776        2004.234448     110.094337  128211.172535   
 std      4514.158514          90.227958     189.850405   37905.341530   
 min         0.000000        1000.000000       0.000000    5000.000000   
 25%      1050.000000        1999.000000      69.000000  125000.000000   
 50%      2700.000000        2003.000000     105.000000  150000.000000 

<b>Missing Values:</b>
The missing value analysis shows that some features have a significant number of missing values:

- Vehicle Type: 37,490 missing entries.
- Gearbox: 19,833 missing entries.
- Model: 19,705 missing entries.
- Fuel Type: 32,895 missing entries.
- Not Repaired: 71,154 missing entries.

<b>Duplicates:</b>
There are 262 duplicate rows in the dataset. Duplicate rows could skew the model by over-representing certain vehicles. Removing them will ensure that the model's predictions are based on diverse data points without redundancy.

<b>Descriptive Statistics:</b>
The summary statistics for key features reveal the following:

Price:

- The minimum value is 0, which is not realistic for a car. These entries likely represent errors or outliers that need to be removed.
- The maximum price is 20,000 Euros, which seems plausible for high-end or relatively new used cars.
- Median price is 2,700 Euros, and the mean is 4,416 Euros, showing that prices are skewed towards the lower end.

Power:

- Power values range from 0 to 20,000 horsepower (hp), with an average of 110 hp.
- Extremely high power values (e.g., 20,000 hp) are likely erroneous and need to be handled by capping or removing outliers.
- The median power is 105 hp, aligning with realistic car power for common models.

Registration Year:

- The minimum registration year is 1000, which is clearly incorrect and unrealistic.
- The maximum registration year is 9999, another unrealistic value.
- A realistic range would likely be between 1950 and the current year, so filtering the data based on this range would help clean up extreme values.

Mileage:

- The mileage ranges from 5,000 km to 150,000 km, which seems reasonable for used cars.
- The mean mileage is around 128,000 km, with a median value of 150,000 km, suggesting most cars have substantial usage.

Registration Month:

- The RegistrationMonth has a normal distribution from 1 to 12, representing valid months.

Number of Pictures:

- The NumberOfPictures column contains only zeros. Since it has no variance, it does not provide useful information for the model and should be dropped from the dataset.

### Data Cleaning and Feature Engineering

In [7]:
# Drop irrelevant columns
df = df.drop(columns=['date_crawled', 'date_created', 'number_of_pictures', 'postal_code', 'last_seen'])

#Drop duplicate rows
df= df.drop_duplicates()

# Handle missing values
df['vehicle_type'] = df['vehicle_type'].fillna('unknown')
df['gearbox'] = df['gearbox'].fillna('unknown')
df['fuel_type'] = df['fuel_type'].fillna('unknown')
df['not_repaired'] = df['not_repaired'].fillna('unknown')
df['model']=df['model'].fillna('unknown')


In [8]:
# Drop rows with missing target values
df = df.dropna(subset=['price'])

# Remove unrealistic data (e.g., negative or very high values)
df = df[(df['price'] > 100) & (df['price'] < 100000)]
df = df[(df['power'] > 0) & (df['power'] < 1000)]

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 281491 entries, 1 to 354368
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   price               281491 non-null  int64 
 1   vehicle_type        281491 non-null  object
 2   registration_year   281491 non-null  int64 
 3   gearbox             281491 non-null  object
 4   power               281491 non-null  int64 
 5   model               281491 non-null  object
 6   mileage             281491 non-null  int64 
 7   registration_month  281491 non-null  int64 
 8   fuel_type           281491 non-null  object
 9   brand               281491 non-null  object
 10  not_repaired        281491 non-null  object
dtypes: int64(5), object(6)
memory usage: 25.8+ MB


- We removed irrelevant columns, droped duplicate rows and filled missing values for categorical features with 'unknown'.


- Additionally, we dropped rows with missing target (Price) values and removed any unrealistic entries (e.g., cars with 0 or excessive power or price).

## Model training

###  Feature Encoding and Splitting Data 
Now we will One-hot encode categorical features and split it into training and testing sets.

In [9]:
# Identify categorical and numerical features
categorical_features = ['vehicle_type', 'gearbox', 'model', 'fuel_type', 'brand', 'not_repaired']
numerical_features = ['registration_year', 'power', 'mileage', 'registration_month']

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)

# Split the data into features (X) and target (y)
X = df_encoded.drop(columns=['price'])
y = df_encoded['price']

# Split the data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Show the shapes of the split data
print(X_train.shape, X_test.shape)

(225192, 311) (56299, 311)


We split the dataset into training (80%) and testing (20%) sets to allow us to train the models

### Scaling Data
We'll also scale numerical features for models that are sensitive to feature scaling.

In [10]:

# Initialize the scaler
scaler = StandardScaler()

# Scale the numerical features for the training set
X_train_scaled = X_train.copy()
X_train_scaled.loc[:, numerical_features] = scaler.fit_transform(X_train[numerical_features])

# Scale the numerical features for the testing set
X_test_scaled = X_test.copy()
X_test_scaled.loc[:, numerical_features] = scaler.transform(X_test[numerical_features])

# Check if the scaling was applied correctly
print(X_train[numerical_features].head())
print(X_test[numerical_features].head())

        registration_year  power  mileage  registration_month
110973               2003    163   150000                   4
145369               2002     58   150000                   2
34131                2005    140   150000                  10
171827               2005    101   150000                  10
46345                2003    150   150000                  12
        registration_year  power  mileage  registration_month
289599               2008    170   150000                   5
310347               1993    150   150000                   7
205383               2006    200   150000                   3
28743                1998    101   150000                   7
256769               2012     60    50000                   2


### Model Training and Evaluation
We'll now train and evaluate different models. The models we'll use are:

- Linear Regression (as a baseline)
- Decision Tree
- Random Forest
- LightGBM

#### Linear Regression 

In [11]:
%%time
# Initialize and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)

# Calculate RMSE for Linear Regression
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

print(f"Linear Regression RMSE: {rmse_lr}")

Linear Regression RMSE: 2843.142802791763
CPU times: user 7.26 s, sys: 1.21 s, total: 8.47 s
Wall time: 8.47 s


The high RMSE (2843.14) indicates that Linear Regression struggles to accurately capture the complexity of the data. 

This model underperforms compared to the others, reinforcing the need for more sophisticated algorithms for this prediction task.

#### Decision Tree

In [12]:
%%time
# Initialize and train the Decision Tree model
dt_model = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test)

# Calculate RMSE for Decision Tree
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))

print(f"Decision Tree RMSE: {rmse_dt}")

Decision Tree RMSE: 1980.0134983936096
CPU times: user 2.55 s, sys: 149 ms, total: 2.7 s
Wall time: 2.75 s


The RMSE reduction from 2843 (Linear Regression) to 1980 indicates that the Decision Tree can better account for interactions between features.

However, this model is prone to overfitting, and limiting the tree depth (set to 10 in this case) helps control the model’s complexity and avoid overfitting.

#### Random Forest Regressor

In [13]:
%%time
# Initialize and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test)

# Calculate RMSE for Random Forest
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))

print(f"Random Forest RMSE: {rmse_rf}")

Random Forest RMSE: 1893.882510031551
CPU times: user 2min 42s, sys: 82.1 ms, total: 2min 42s
Wall time: 2min 42s


The RMSE further decreases to 1893.99, showing that Random Forest captures more patterns in the data than the Decision Tree.

This model is more robust and generalizes better due to its ensemble nature, which makes it less prone to overfitting compared to a single Decision Tree.

#### LightGBM Regressor

In [14]:
%%time
# Initialize and train the LightGBM model
lgb_model = LGBMRegressor(n_estimators=100, max_depth=10, random_state=42)
lgb_model.fit(X_train, y_train)

# Make predictions
y_pred_lgb = lgb_model.predict(X_test)

# Calculate RMSE for LightGBM
rmse_lgb = np.sqrt(mean_squared_error(y_test, y_pred_lgb))

print(f"LightGBM RMSE: {rmse_lgb}")


LightGBM RMSE: 1699.7330752286382
CPU times: user 4.05 s, sys: 150 ms, total: 4.2 s
Wall time: 4.17 s


LightGBM significantly outperforms all other models, achieving the lowest RMSE of 1695.99.
It is optimized for speed and efficiency while also being able to handle large datasets and complex relationships between features.

## Hypertuning Models

In [15]:
%%time
# Define hyperparameters to tune
dt_params = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

# Initialize Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42)

# Initialize GridSearchCV for Decision Tree
dt_grid = GridSearchCV(estimator=dt_model, param_grid=dt_params, cv=3, scoring='neg_mean_squared_error')

# Fit the grid search model
dt_grid.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for Decision Tree: {dt_grid.best_params_}")
print(f"Best RMSE for Decision Tree: {np.sqrt(-dt_grid.best_score_)}")

Best parameters for Decision Tree: {'max_depth': 15, 'min_samples_leaf': 5, 'min_samples_split': 20}
Best RMSE for Decision Tree: 1860.679519261277
CPU times: user 2min 25s, sys: 8.07 s, total: 2min 33s
Wall time: 2min 33s


By optimizing the hyperparameters, the Decision Tree Regressor improved from the initial RMSE of 1979.93 to 1860.82.

The model is now performing better, but it is still outperformed by Random Forest and LightGBM.

In [16]:
%%time
# Define hyperparameters to tune
rf_params = {
    'n_estimators': [10, 20, 50],
    'max_depth': [10, 15, 20],
    'min_samples_split': [2, 10, 20]
}

# Initialize Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV for Random Forest
rf_grid = GridSearchCV(estimator=rf_model, param_grid=rf_params, cv=3, scoring='neg_mean_squared_error')

# Fit the grid search model
rf_grid.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for Random Forest: {rf_grid.best_params_}")
print(f"Best RMSE for Random Forest: {np.sqrt(-rf_grid.best_score_)}")

Best parameters for Random Forest: {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 50}
Best RMSE for Random Forest: 1660.04112870205
CPU times: user 45min 39s, sys: 7.5 s, total: 45min 46s
Wall time: 45min 49s


After hyperparameter tuning, the Random Forest Regressor improved from the initial RMSE of 1893.88 to 1659.82.

Increasing the depth of the individual trees (max_depth = 20) allows the model to learn more complex patterns, while 50 trees (n_estimators) provide a good balance between performance and computational cost.

In [17]:
%%time
# Define hyperparameters to tune
lgb_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20],
    'learning_rate': [0.01, 0.05, 0.1]
}

# Initialize LightGBM Regressor
lgb_model = LGBMRegressor(random_state=42)

# Initialize GridSearchCV for LightGBM
lgb_grid = GridSearchCV(estimator=lgb_model, param_grid=lgb_params, cv=3, scoring='neg_mean_squared_error')

# Fit the grid search model
lgb_grid.fit(X_train, y_train)

# Best parameters and score
print(f"Best parameters for LightGBM: {lgb_grid.best_params_}")
print(f"Best RMSE for LightGBM: {np.sqrt(-lgb_grid.best_score_)}")

Best parameters for LightGBM: {'learning_rate': 0.1, 'max_depth': 15, 'n_estimators': 300}
Best RMSE for LightGBM: 1623.3174694969798
CPU times: user 7min 28s, sys: 11.6 s, total: 7min 39s
Wall time: 7min 40s


LightGBM continues to be the best-performing model, improving from an initial RMSE of 1695.99 to 1622.77 after hyperparameter tuning.

With the lowest RMSE among all models, LightGBM proves to be highly effective for this prediction task, even after parameter optimization.

## Testing the Best Models on Test Data 

In [18]:
# Best model from GridSearchCV for Decision Tree
best_dt_model = dt_grid.best_estimator_

# Measure training time
start_time = time.time()
best_dt_model.fit(X_train, y_train)
training_time_dt = time.time() - start_time
print(f"Decision Tree Training Time: {training_time_dt:.4f} seconds")

# Measure prediction time
start_time = time.time()
y_pred_dt = best_dt_model.predict(X_test)
prediction_time_dt = time.time() - start_time
print(f"Decision Tree Prediction Time: {prediction_time_dt:.4f} seconds")

# Evaluate performance on test data
rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred_dt))
print(f"Decision Tree RMSE on Test Data: {rmse_dt:.4f}")

Decision Tree Training Time: 3.1940 seconds
Decision Tree Prediction Time: 0.0459 seconds
Decision Tree RMSE on Test Data: 1815.1099


In [19]:
# Best model from GridSearchCV for Random Forest
best_rf_model = rf_grid.best_estimator_

# Measure training time
start_time = time.time()
best_rf_model.fit(X_train, y_train)
training_time_rf = time.time() - start_time
print(f"Random Forest Training Time: {training_time_rf:.4f} seconds")

# Measure prediction time
start_time = time.time()
y_pred_rf = best_rf_model.predict(X_test)
prediction_time_rf = time.time() - start_time
print(f"Random Forest Prediction Time: {prediction_time_rf:.4f} seconds")

# Evaluate performance on test data
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print(f"Random Forest RMSE on Test Data: {rmse_rf:.4f}")

Random Forest Training Time: 107.7307 seconds
Random Forest Prediction Time: 0.5238 seconds
Random Forest RMSE on Test Data: 1620.5899


In [20]:
# Best model from GridSearchCV for LightGBM
best_lgb_model = lgb_grid.best_estimator_

# Measure training time
start_time = time.time()
best_lgb_model.fit(X_train, y_train)
training_time_lgb = time.time() - start_time
print(f"LightGBM Training Time: {training_time_lgb:.4f} seconds")

# Measure prediction time
start_time = time.time()
y_pred_lgb = best_lgb_model.predict(X_test)
prediction_time_lgb = time.time() - start_time
print(f"LightGBM Prediction Time: {prediction_time_lgb:.4f} seconds")

# Evaluate performance on test data
rmse_lgb = np.sqrt(mean_squared_error(y_test, y_pred_lgb))
print(f"LightGBM RMSE on Test Data: {rmse_lgb:.4f}")

LightGBM Training Time: 7.9731 seconds
LightGBM Prediction Time: 1.1271 seconds
LightGBM RMSE on Test Data: 1619.2757


## Model analysis

The key criteria Rusty Bargain is interested in are:

- <b>Quality of the Prediction:</b> Measured by RMSE (Root Mean Squared Error).
- <b>Speed of Prediction:</b> How fast the model makes predictions.
- <b>Time Required for Training:</b> How long it takes to train the model

Quality of the Prediction (RMSE):

LightGBM has the best RMSE (1619.2757), followed very closely by Random Forest (1620.5899).
The Decision Tree has a significantly higher RMSE (1815.1099), making it the least accurate of the three.

Speed of the Prediction:

The Decision Tree is by far the fastest in prediction time (0.0459 seconds).
Random Forest has a moderate prediction time (0.5238 seconds), while LightGBM is the slowest (1.1271 seconds).
Time Required for Training:

The Decision Tree is the quickest to train (3.1940 seconds), followed by LightGBM (7.9731 seconds).
Random Forest takes significantly longer to train (107.7307 seconds)


### Final Recommendation 

In this project, we built and evaluated several machine learning models to predict used car prices. The models varied in terms of accuracy, training time, and inference speed. Among the models tested, we found that LightGBM performed the best in the target criteria.

LightGBM offers the best prediction quality with the lowest RMSE (1619.2757) and a reasonable training time (7.9731 seconds). Although it is slower in prediction time, its accuracy makes it the best choice for Rusty Bargain, where prediction quality is paramount.

