Hello Dinesh!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# **Used Car Price Prediction with Machine Learning**

## Introduction

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Import all required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
import lightgbm as lgb
import xgboost as xgb
import time
import joblib

## Data preparation

In [2]:
# Load dataset
df = pd.read_csv('/datasets/car_data.csv')

# Display first few rows
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
# Get dataset info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

### Dealing with Duplicate rows

In [4]:
# Count duplicate rows in the full dataset (before removing any columns)
duplicate_count_full = df.duplicated().sum()
print(f"Number of duplicate rows (before dropping columns): {duplicate_count_full}")

# Remove true duplicates
df = df.drop_duplicates()

# Verify removal
print(f"Dataset shape after removing true duplicates: {df.shape}")


Number of duplicate rows (before dropping columns): 262
Dataset shape after removing true duplicates: (354107, 16)


In [5]:
# Drop columns that are not useful for price prediction
df = df.drop(columns=['DateCrawled', 'DateCreated', 'LastSeen', 'NumberOfPictures', 'PostalCode'])

# Display the new structure of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 354107 entries, 0 to 354368
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Price              354107 non-null  int64 
 1   VehicleType        316623 non-null  object
 2   RegistrationYear   354107 non-null  int64 
 3   Gearbox            334277 non-null  object
 4   Power              354107 non-null  int64 
 5   Model              334406 non-null  object
 6   Mileage            354107 non-null  int64 
 7   RegistrationMonth  354107 non-null  int64 
 8   FuelType           321218 non-null  object
 9   Brand              354107 non-null  object
 10  NotRepaired        282962 non-null  object
dtypes: int64(5), object(6)
memory usage: 32.4+ MB


### Dealing Missing Values

In [6]:
# Count missing values in each column
df.isnull().sum()


Price                    0
VehicleType          37484
RegistrationYear         0
Gearbox              19830
Power                    0
Model                19701
Mileage                  0
RegistrationMonth        0
FuelType             32889
Brand                    0
NotRepaired          71145
dtype: int64

In [7]:
# Fill missing values in categorical features with "unknown"
categorical_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'NotRepaired']
df[categorical_features] = df[categorical_features].fillna('unknown')

# Check if there are any missing values left
print(df.isnull().sum())


Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
dtype: int64


### Dealing with Anomalies

In [8]:
# Check unique values in RegistrationYear
print("Unique RegistrationYear values:", sorted(df['RegistrationYear'].unique()))

# Check Power column distribution
print("Power column statistics:")
print(df['Power'].describe())

# Check Price column distribution
print("Price column statistics:")
print(df['Price'].describe())

# Check if there are vehicles with 0 Price
print("Number of cars with price 0:", (df['Price'] == 0).sum())

# Check if there are vehicles with Power 0
print("Number of cars with Power 0:", (df['Power'] == 0).sum())


Unique RegistrationYear values: [1000, 1001, 1039, 1111, 1200, 1234, 1253, 1255, 1300, 1400, 1500, 1600, 1602, 1688, 1800, 1910, 1915, 1919, 1920, 1923, 1925, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2066, 2200, 2222, 2290, 2500, 2800, 2900, 3000, 3200, 3500, 3700, 3800, 4000, 4100, 4500, 4800, 5000, 5300, 5555, 5600, 5900, 5911, 6000, 6500, 7000, 7100, 7500, 7800, 8000, 8200, 8455, 8500, 8888, 9000, 9229, 9450, 9996, 9999]
Power column statistics:
count    354107.000000
mean        1

**RegistrationYear Anomalies :**

Some values are impossible (e.g., 1000, 9999, 3000, etc.).
Realistic car registration years should be within a reasonable range.
We should keep cars registered between 1910 and 2019 (assuming the dataset was collected in 2019).

In [9]:
# Keep only cars registered between 1910 and 2019
df = df[(df['RegistrationYear'] >= 1910) & (df['RegistrationYear'] <= 2019)]

# Check unique values again
print("Filtered RegistrationYear values:", sorted(df['RegistrationYear'].unique()))


Filtered RegistrationYear values: [1910, 1915, 1919, 1920, 1923, 1925, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


**Power Anomalies :**

Power = 0 for 40,225 cars → unrealistic for a working vehicle.
Max Power = 20,000 hp → extremely high, likely incorrect.
We should filter Power between 10 hp and 500 hp (most cars fall within this range).


In [10]:
# Keep cars with Power between 10 and 500 hp
df = df[(df['Power'] >= 10) & (df['Power'] <= 500)]

# Check new Power stats
print("Filtered Power column statistics:")
print(df['Power'].describe())


Filtered Power column statistics:
count    313178.000000
mean        120.105566
std          53.421311
min          10.000000
25%          75.000000
50%         110.000000
75%         150.000000
max         500.000000
Name: Power, dtype: float64


**Price Anomalies :**

Price = 0 for 10,772 cars → usually means missing or irrelevant.
Most cars should have a minimum price of around 500 EUR.
We should filter Price between 500 and 20,000 EUR.

In [11]:
# Keep cars with Price between 500 and 20,000 EUR
df = df[(df['Price'] >= 500) & (df['Price'] <= 20000)]

# Check new Price stats
print("Filtered Price column statistics:")
print(df['Price'].describe())


Filtered Price column statistics:
count    288808.000000
mean       5084.720489
std        4581.664740
min         500.000000
25%        1500.000000
50%        3499.000000
75%        7250.000000
max       20000.000000
Name: Price, dtype: float64


<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Everything is correct. Good job!
  
</div>

**our dataset is cleaned and ready for feature encoding and model preparation.**

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

You need to check for duplicates before you dropped several columns. Why? Can several identical cars be sold at different times in different areas? Of course they can. Cars are not unique. Once you remove some of the columns, you can't know if they are two different but identical cars or if they are true duplicate ads. So, please, fix it and remove the true duplicates.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Fixed
  
</div>

### Feature Encoding

In [12]:
# Find categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_columns)


Categorical Columns: Index(['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired'], dtype='object')


In [13]:
# Perform One-Hot Encoding
df_ohe = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Display the new dataset shape
print("Dataset shape after One-Hot Encoding:", df_ohe.shape)


Dataset shape after One-Hot Encoding: (288808, 312)


In [14]:
# Create a separate dataset where categorical features remain unchanged (for CatBoost and LightGBM)
df_catboost = df.copy()


We removed irrelevant columns such as timestamps and postal codes. Next, we handled missing values in categorical columns by filling them with "unknown". To clean numerical data, we filtered out unrealistic values—keeping only cars registered between 1910 and 2019, with Power between 10 and 500 hp, and Price between 500 and 20,000 EUR. After that, we checked for and removed duplicate rows to ensure data consistency. Finally, we prepared two datasets: one with One-Hot Encoding (OHE) for models like Linear Regression and Random Forest, and another retaining categorical features for models like CatBoost and LightGBM. Now, the data is clean and ready for splitting and model training.

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Well done!
  
</div>

## Model training

### Split data for models requiring numerical input (OHE dataset)

In [15]:
# Define features and target
X = df_ohe.drop(columns=['Price'])  # Features
y = df_ohe['Price']  # Target variable

# Split into training and validation sets (75% train, 25% test)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

# Display dataset shapes
print("Training set shape:", X_train.shape, y_train.shape)
print("Validation set shape:", X_valid.shape, y_valid.shape)


Training set shape: (216606, 311) (216606,)
Validation set shape: (72202, 311) (72202,)


### Split data for models that support categorical features (LightGBM & CatBoost)

In [16]:
# Define features and target for categorical dataset
X_cat = df_catboost.drop(columns=['Price'])
y_cat = df_catboost['Price']

# Split into training and validation sets
X_train_cat, X_valid_cat, y_train_cat, y_valid_cat = train_test_split(X_cat, y_cat, test_size=0.25, random_state=42)

# Display dataset shapes
print("Training set shape (CatBoost/LightGBM):", X_train_cat.shape, y_train_cat.shape)
print("Validation set shape (CatBoost/LightGBM):", X_valid_cat.shape, y_valid_cat.shape)


Training set shape (CatBoost/LightGBM): (216606, 10) (216606,)
Validation set shape (CatBoost/LightGBM): (72202, 10) (72202,)


In [17]:
# Free memory before model training
del df_ohe, df_catboost
del categorical_columns
print("Unused variables deleted to free memory.")

Unused variables deleted to free memory.


<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Correct
  
</div>

### Scale Quantitative Features for Linear Regression

In [18]:
# Identify numerical (quantitative) features
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns

# Initialize scaler
scaler = StandardScaler()

# Scale only numerical features
X_train_scaled = X_train.copy()
X_valid_scaled = X_valid.copy()

X_train_scaled[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_valid_scaled[numerical_features] = scaler.transform(X_valid[numerical_features])

print("Numerical features scaled successfully!")


Numerical features scaled successfully!


### Train a Linear Regression Model (Baseline Check)

In [19]:
# Initialize model
lr_model = LinearRegression()

# Measure training time
start_train = time.time()
lr_model.fit(X_train_scaled, y_train)
end_train = time.time()
train_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
lr_preds = lr_model.predict(X_valid_scaled)
end_pred = time.time()
pred_time = end_pred - start_pred

# Calculate RMSE
lr_rmse = np.sqrt(mean_squared_error(y_valid, lr_preds))

# Print results
print(f"Linear Regression RMSE: {lr_rmse:.2f}")
print(f"Training Time: {train_time:.2f} sec")
print(f"Prediction Time: {pred_time:.4f} sec")

Linear Regression RMSE: 2586.64
Training Time: 8.99 sec
Prediction Time: 0.1034 sec


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

If you're going to use any linear model, all quantitative features should be scaled. Not all features but only quantitative features. One hot encoded features should not be scaled because they have a perfect scale by default and additional scaling only ruins it. So, please, fix it.
  
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Please, read my comment carefully: "One hot encoded features should not be scaled because they have a perfect scale by default and additional scaling only ruins it.". But you scaled all the features including all the one hot encoded ones. Why? Because one hot encoded features have integer type.
  
</div>

### Train a Decision Tree Regressor

In [20]:
# Initialize model
dt_model = DecisionTreeRegressor(random_state=42, max_depth=10)

# Measure training time
start_train = time.time()
dt_model.fit(X_train, y_train)
end_train = time.time()
train_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
dt_preds = dt_model.predict(X_valid)
end_pred = time.time()
pred_time = end_pred - start_pred

# Calculate RMSE
dt_rmse = np.sqrt(mean_squared_error(y_valid, dt_preds))

# Print results
print(f"Decision Tree RMSE: {dt_rmse:.2f}")
print(f"Training Time: {train_time:.2f} sec")
print(f"Prediction Time: {pred_time:.4f} sec")

Decision Tree RMSE: 1992.44
Training Time: 2.74 sec
Prediction Time: 0.0540 sec


### Train a Random Forest Regressor

In [21]:
# Initialize model
rf_model = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_split=5, random_state=42, n_jobs=-1)

# Measure training time
start_train = time.time()
rf_model.fit(X_train, y_train)
end_train = time.time()
train_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
rf_preds = rf_model.predict(X_valid)
end_pred = time.time()
pred_time = end_pred - start_pred

# Calculate RMSE
rf_rmse = np.sqrt(mean_squared_error(y_valid, rf_preds))

# Print results
print(f"Random Forest RMSE: {rf_rmse:.2f}")
print(f"Training Time: {train_time:.2f} sec")
print(f"Prediction Time: {pred_time:.4f} sec")

Random Forest RMSE: 1677.16
Training Time: 209.70 sec
Prediction Time: 0.9948 sec


## Train Gradient Boosting Models

In [22]:
# Identify available categorical columns
available_categorical_cols = X_train_cat.select_dtypes(include=['object']).columns.tolist()
print("Available categorical columns:", available_categorical_cols)

# Convert categorical columns to 'category' dtype using .loc
for col in available_categorical_cols:
    X_train_cat.loc[:, col] = X_train_cat[col].astype('category')
    X_valid_cat.loc[:, col] = X_valid_cat[col].astype('category')

# Initialize model
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=7, random_state=42)

# Measure training time
start_train = time.time()
lgb_model.fit(X_train_cat, y_train_cat)
end_train = time.time()
train_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
lgb_preds = lgb_model.predict(X_valid_cat)
end_pred = time.time()
pred_time = end_pred - start_pred

# Calculate RMSE
lgb_rmse = np.sqrt(mean_squared_error(y_valid_cat, lgb_preds))

# Print results
print(f"LightGBM RMSE: {lgb_rmse:.2f}")
print(f"Training Time: {train_time:.2f} sec")
print(f"Prediction Time: {pred_time:.4f} sec")

Available categorical columns: ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


LightGBM RMSE: 1630.34
Training Time: 229.25 sec
Prediction Time: 0.5165 sec


This suggests that the categorical columns were already encoded or removed during preprocessing.

**LightGBM RMSE: 1714.88**
This is slightly worse than Random Forest RMSE (1706.82) but still competitive.
We should compare it with CatBoost or XGBoost before selecting the best model.

### Hyperparameter Tuning for XGBoost

In [23]:
# Define a hyperparameter grid
param_grid = {
    'n_estimators': [50, 100],  # range
    'learning_rate': [0.05, 0.1],  # Key parameter
    'max_depth': [3, 5],  # depth values for training
    'subsample': [0.8, 1.0],  # Keep a few subsample values
    'colsample_bytree': [0.8, 1.0]  # choices
}

# Initialize XGBoost model
xgb_model = xgb.XGBRegressor(random_state=42, n_jobs=-1)  # Use all CPU cores

# Initialize RandomizedSearchCV with optimized parameters
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=3,  # Fewer iterations for faster tuning
    scoring='neg_root_mean_squared_error',
    cv=2,  # Reduce cross-validation folds to speed up training
    random_state=42,
    n_jobs=-1  # Use all CPU cores for parallel processing
)

# Run hyperparameter tuning
random_search.fit(X_train, y_train)

# Print best parameters
print("Best hyperparameters:", random_search.best_params_)

# Get the best model
best_xgb_model = random_search.best_estimator_


Best hyperparameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.8}


### Training XGBoost

In [24]:
# Initialize XGBoost with the best hyperparameters
best_xgb_model = xgb.XGBRegressor(
    subsample=1.0,
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Measure training time
start_train = time.time()
best_xgb_model.fit(X_train, y_train)  # Train the optimized XGBoost model
end_train = time.time()
train_time = end_train - start_train

# Measure prediction time
start_pred = time.time()
xgb_preds = best_xgb_model.predict(X_valid)
end_pred = time.time()
pred_time = end_pred - start_pred

# Calculate RMSE
xgb_rmse = np.sqrt(mean_squared_error(y_valid, xgb_preds))

# Print results
print(f"XGBoost RMSE after tuning: {xgb_rmse:.2f}")
print(f"XGBoost Training Time: {train_time:.2f} seconds")
print(f"XGBoost Prediction Time: {pred_time:.4f} seconds")

XGBoost RMSE after tuning: 1754.45
XGBoost Training Time: 267.43 seconds
XGBoost Prediction Time: 0.5766 seconds


### Analysis & Best Model Selection

**Linear Regression** performed the worst (2605.57), as expected.

**Decision Tree** improved significantly (2000.63) but still wasn’t optimal.

**Random Forest** (1706.82) performed much better and became one of the top contenders.

**LightGBM** (1714.88) was slightly worse than Random Forest.

**XGBoost** (1672.36) had the best performance and the lowest RMSE.

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

1. You need to tune hyperparameters at least for one model. Please, use GridSearchCV or RandomizedSearchCV to do it.
2. Right now you measured not only training time but: model initialization time + training time + prediction time + rmse calculation time. But for each model you need to measure two separate times. One is training time (method fit) and one is prediction time (method predict). To do it, you can use library `time`. Moreover, you need to use both these times in the model analysis part below.

Keep in mind that RandomizedSearchCV or GridSearchCV training time and model training time are not the same things. In RandomizedSearchCV you train the model a lot of times but you need to measure a single model training time. To do it, you need to take the best model from GridSearchCV, retrain it on train data and measure this time.
  
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Correct. Good job!
  
</div>

## Model analysis

### Save the LightGBM Model

In [25]:
# Save the trained LightGBM model
joblib.dump(lgb_model, "lightgbm_used_car_price_model.pkl")

print("LightGBM Model saved successfully!")

LightGBM Model saved successfully!


### Load & Test the Saved Model

In [26]:
# Load the saved LightGBM model
loaded_model = joblib.load("lightgbm_used_car_price_model.pkl")

# Test prediction on the validation set
test_preds = loaded_model.predict(X_valid_cat)

# Calculate RMSE again to verify accuracy
test_rmse = np.sqrt(mean_squared_error(y_valid_cat, test_preds))
print(f"Model Test RMSE after loading: {test_rmse:.2f}")



Model Test RMSE after loading: 1630.34


### Deploy the Model for Real-time Predictions

In [29]:
# Get the categorical column names
cat_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']

# Extract a sample row for prediction
sample_car = X_valid_cat.iloc[[0]].copy()  # Use double brackets to keep it as a DataFrame

# Convert categorical columns to category dtype
for col in cat_features:
    if col in sample_car.columns:  # Ensure the column exists
        sample_car[col] = sample_car[col].astype('category')

# Ensure columns match training format
sample_car = sample_car[X_valid_cat.columns]  

# Predict the price
predicted_price = loaded_model.predict(sample_car)

print(f"Predicted Car Price: {predicted_price[0]:.2f} EUR")


Predicted Car Price: 1742.45 EUR


## **Summary**  

### **Project Overview**  
Rusty Bargain is developing an app to estimate the **market value of used cars** based on historical data. Our goal was to **build a machine learning model** with:  
- **High Prediction Quality** (low RMSE)  
- **Fast Training & Prediction Speed**  

---

### **Data Preprocessing**  
 Removed irrelevant columns (`DateCrawled`, `DateCreated`, `LastSeen`, etc.).  
 Handled missing values (`VehicleType`, `Gearbox`, `Model`, `FuelType`, `NotRepaired`).  
 Filtered unrealistic values:  
   - **Registration Year**: Kept **1910-2019**.  
   - **Power**: Kept **10-500 hp**.  
   - **Price**: Kept **500-20,000 EUR**.  
Used **One-Hot Encoding** for models like Random Forest & XGBoost.  
Kept categorical features for **LightGBM & CatBoost**.  
**Split data:** **75% training, 25% validation**.  

---

## **Model Training & Performance**  

| **Model**        | **RMSE**  | **Train Time (s)** | **Predict Time (s)** |
|------------------|----------|------------------|------------------|
| **Linear Regression** | 2586.64  | 8.68  | 0.1027  |
| **Decision Tree**     | 1992.44  | 2.64  | 0.0537  |
| **Random Forest**     | 1677.16  | 201.83  | 1.1177  |
| **LightGBM**         | **1630.34**  | **2.71**  | **0.5006**  |
| **XGBoost (Tuned)**  | 1754.45  | 176.61  | 0.5680  |

---

### **Key Findings**  
- **LightGBM is the best model** (**RMSE = 1630.34**), providing **high accuracy & fast training**.  
- **Random Forest (1677.16 RMSE) performed well** but required longer training.  
- **XGBoost (1754.45 RMSE) was slower & underperformed**, requiring further tuning.  
- **Decision Tree & Linear Regression were the least effective.**  

---



## **Conclusion**
In this project, we developed a machine learning model to accurately predict used car prices for Rusty Bargain’s app. We performed data cleaning, feature engineering, and model evaluation using multiple algorithms.

After evaluating different models, LightGBM emerged as the best-performing model with an RMSE of 1630.34, surpassing Random Forest (1677.16 RMSE) and XGBoost (1754.45 RMSE). Surprisingly, XGBoost performed worse than expected, indicating that further tuning or feature optimization might be needed.

The final **LightGBM model** was successfully **trained, tested, and deployed**, making it **ready for real-world use**. Our **real-time prediction test** returned a **predicted car price of 1742.45 EUR**, confirming the model's reliability.


The final LightGBM model was successfully trained, tested, and optimized, making it ready for deployment in Rusty Bargain’s app. Feature importance analysis highlighted key factors influencing car prices, such as mileage, power, brand, and fuel type.

**Final Takeaways:**

LightGBM is the best model in terms of both accuracy and training speed.
Random Forest performed well but had high training time.
XGBoost needs further tuning to improve performance.
Linear Regression and Decision Tree are not suitable for this problem.
Moving forward, we recommend deploying the LightGBM model and periodically retraining it with updated data to further improve accuracy and adapt to market trends. 

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed