# Introduction
Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

### Libraries

In [1]:
# Import libraries required for analysis 
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from time import process_time 
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor #optional

## Data preparation

In [2]:
# Load data
car_data = pd.read_csv('/datasets/car_data.csv')

print(car_data.info())
print('-'*40)
print(car_data.head())
print('-'*40)
print(car_data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [3]:
# Check for missing values
print(car_data.isnull().sum())
print('-'*40)
print(car_data.duplicated().sum())

DateCrawled              0
Price                    0
VehicleType          37490
RegistrationYear         0
Gearbox              19833
Power                    0
Model                19705
Mileage                  0
RegistrationMonth        0
FuelType             32895
Brand                    0
NotRepaired          71154
DateCreated              0
NumberOfPictures         0
PostalCode               0
LastSeen                 0
dtype: int64
----------------------------------------
262


In [4]:
# Keep only one of the duplicated rows
car_data.drop_duplicates(keep = 'first', inplace = True)
car_data.duplicated().sum()

0

In [5]:
# percentage of missing values 
print((car_data.isnull().sum()/len(car_data))*100)
# Replace missing values with Unknown
car_data.fillna('Unknown', inplace=True)
print('-'*40)
# confirm missing values replaced
print(car_data.isna().sum())

DateCrawled           0.000000
Price                 0.000000
VehicleType          10.585501
RegistrationYear      0.000000
Gearbox               5.600002
Power                 0.000000
Model                 5.563573
Mileage               0.000000
RegistrationMonth     0.000000
FuelType              9.287871
Brand                 0.000000
NotRepaired          20.091385
DateCreated           0.000000
NumberOfPictures      0.000000
PostalCode            0.000000
LastSeen              0.000000
dtype: float64
----------------------------------------
DateCrawled          0
Price                0
VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
DateCreated          0
NumberOfPictures     0
PostalCode           0
LastSeen             0
dtype: int64


In [6]:
# Drop categories from df
car_data.drop(['DateCrawled', 'RegistrationMonth', 'DateCreated', 'PostalCode', 'NumberOfPictures', 'LastSeen'], axis=1, inplace = True)

# Check if columns were dropped
car_data.columns

Index(['Price', 'VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Mileage', 'FuelType', 'Brand', 'NotRepaired'],
      dtype='object')

### CONCLUSION

After looking into the data I noticed duplicated rows so only one was kept. I also noticed missing values and these happen to only occur in categorical columns so I replaced them with 'unkown'.

I also dropped all the columns that will not affect the model in determining the value of a car.

## Model training

### OHE categorical columns 

In [7]:
# Categorical columns list
categorical_columns = ['VehicleType','Gearbox','Model','FuelType','Brand','NotRepaired']

# Convert categorical columns to numerical using OHE
car_data_ohe = pd.get_dummies(car_data, columns=categorical_columns, drop_first=False)
car_data_ohe.head()


Unnamed: 0,Price,RegistrationYear,Power,Mileage,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
0,480,1993,0,150000,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1,18300,2011,190,125000,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,9800,2004,163,125000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1500,2001,75,150000,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4,3600,2008,69,90000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


### Splitting data

In [8]:
features = car_data_ohe.drop('Price',axis=1)
target = car_data_ohe['Price']
features_train, features_test_valid, target_train, target_test_valid = train_test_split(features, target, test_size=0.30, random_state=12345)
features_test, features_valid, target_test, target_valid = train_test_split(features_test_valid, target_test_valid, test_size=0.50, random_state=12345)

print(features_train.shape)  
print(target_train.shape)  

# Validation
print(features_valid.shape) 
print(target_valid.shape)   

# Test
print(features_test.shape)   
print(target_test.shape)


(247874, 317)
(247874,)
(53117, 317)
(53117,)
(53116, 317)
(53116,)


### Scaling columns

In [9]:

# Apply scaling to numeric columns to avoid having features being deemed as more important.
numeric = ['RegistrationYear','Power','Mileage']
scaler = StandardScaler()
scaler.fit(features_train[numeric])

# Create explicit copies to avoid warnings
features_train = features_train.copy()
features_valid = features_valid.copy()
features_test = features_test.copy()

features_train.loc[:, numeric] = scaler.transform(features_train[numeric])
features_valid.loc[:, numeric] = scaler.transform(features_valid[numeric])
features_test.loc[:, numeric] = scaler.transform(features_test[numeric])

In [10]:
display(features_train.head())
display(features_valid.head())
display(features_test.head())

Unnamed: 0,RegistrationYear,Power,Mileage,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
335537,0.008761,-0.049265,0.574261,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
312714,0.031186,0.163029,0.574261,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
60759,0.154525,-0.272447,0.574261,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
347328,-0.08094,0.081378,0.574261,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
11165,0.008761,0.005169,0.574261,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Unnamed: 0,RegistrationYear,Power,Mileage,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
124153,-0.03609,-0.599054,0.574261,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
314901,-0.002452,-0.599054,-0.086719,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
326837,-0.047303,-0.190795,0.574261,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
188686,-0.047303,-0.272447,0.574261,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
334649,0.087249,-0.027491,0.574261,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


Unnamed: 0,RegistrationYear,Power,Mileage,VehicleType_Unknown,VehicleType_bus,VehicleType_convertible,VehicleType_coupe,VehicleType_other,VehicleType_sedan,VehicleType_small,...,Brand_sonstige_autos,Brand_subaru,Brand_suzuki,Brand_toyota,Brand_trabant,Brand_volkswagen,Brand_volvo,NotRepaired_Unknown,NotRepaired_no,NotRepaired_yes
165355,0.008761,1.061198,-0.7477,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
216348,0.008761,-0.190795,-2.598444,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
77596,-0.047303,0.081378,0.574261,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
152824,-0.024877,-0.326881,0.574261,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
68173,-0.170641,-0.599054,0.574261,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0


## Model analysis

### Baseline linear regression model

In [11]:
%%time 

# Initialize Model
model = LinearRegression()

# Fit Model to Training Data
model.fit(features_train, target_train)

# Predict Test Target
predicted_values = model.predict(features_valid)

# Calculate RMSE
RMSE = np.sqrt(mean_squared_error(target_valid, predicted_values))

# Print RMSE Score
print('The RMSE for the LinearRegression Model is:', round(RMSE,2))

# Print Time Elapsed For Model Runtime
print('Run Time for LinearRegression:')

The RMSE for the LinearRegression Model is: 3186.78
Run Time for LinearRegression:
CPU times: user 11.8 s, sys: 4.23 s, total: 16 s
Wall time: 8.44 s


### Random Forest Model

In [12]:
%%time
rf = RandomForestRegressor(random_state=12345)

rf_params = {
    'n_estimators': [5, 10, 20],
    'max_depth': [5, 10, 20]
}

rf_grid = GridSearchCV(
    rf,
    param_grid=rf_params,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1
)

rf_grid.fit(features_train, target_train)

rf_best = rf_grid.best_estimator_
rf_pred = rf_best.predict(features_valid)
rf_rmse = np.sqrt(mean_squared_error(target_valid, rf_pred))

print("RandomForest Best Params:", rf_grid.best_params_)
print("RandomForest Train Best RMSE:", round(rf_grid.best_score_, 2))
print("RandomForest Validation RMSE:", round(rf_rmse, 2))

Fitting 3 folds for each of 9 candidates, totalling 27 fits
RandomForest Best Params: {'max_depth': 20, 'n_estimators': 20}
RandomForest Train Best RMSE: -1807.14
RandomForest Validation RMSE: 1787.16
CPU times: user 7min 47s, sys: 3.46 s, total: 7min 51s
Wall time: 7min 51s


### LightGBM model

In [13]:
%%time

from lightgbm import LGBMRegressor

lgb = LGBMRegressor(random_state=12345)

#also trying ulimited depth
lgb_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, -1]
}

lgb_grid = GridSearchCV(
    lgb,
    param_grid=lgb_params,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1
)

lgb_grid.fit(features_train, target_train)

lgb_best = lgb_grid.best_estimator_
lgb_pred = lgb_best.predict(features_valid)
lgb_rmse = np.sqrt(mean_squared_error(target_valid, lgb_pred))

print("LightGBM Best Params:", lgb_grid.best_params_)
print("LightGBM Train Best RMSE:", round(lgb_grid.best_score_, 2))
print("LightGBM Validation RMSE:", round(lgb_rmse, 2))

Fitting 3 folds for each of 9 candidates, totalling 27 fits
LightGBM Best Params: {'max_depth': -1, 'n_estimators': 200}
LightGBM Train Best RMSE: -1794.82
LightGBM Validation RMSE: 1803.45
CPU times: user 2min 29s, sys: 5.27 s, total: 2min 35s
Wall time: 1min 25s


### CatBoost model

In [14]:

%%time
# Initialize CatBoost with silent logging
cat = CatBoostRegressor(
    loss_function='RMSE',
    logging_level='Silent',
    random_state=12345
)

# Parameter grid
cat_params = {
    'depth': [4, 6, 8],
    'iterations': [200, 400]
}

# Grid search - remove log_cout and log_cerr parameters
cat_grid = cat.grid_search(
    cat_params,
    X=features_train,
    y=target_train,
    cv=3,
    partition_random_seed=42,
    verbose=False  # This should be enough to suppress output
)


# Extract best parameters
best_params = cat_grid['params']

# Train CatBoost with best parameters
cat_best = CatBoostRegressor(
    **best_params,
    loss_function='RMSE',
    logging_level='Silent',
    random_state=12345
)
cat_best.fit(features_train, target_train)

# Predict and calculate RMSE
cat_pred = cat_best.predict(features_valid)
cat_rmse = np.sqrt(mean_squared_error(target_valid, cat_pred))

# Print only best parameters and RMSE
print("CatBoost Best Parameters:")
for k, v in best_params.items():
    print(f"  {k}: {v}")
print(f"CatBoost Validation RMSE: {cat_rmse:.2f}")

CatBoost Best Parameters:
  depth: 8
  iterations: 400
CatBoost Validation RMSE: 1741.83
CPU times: user 3min 26s, sys: 19 s, total: 3min 45s
Wall time: 1min 53s


In [15]:
print("\n--- RMSE Comparison ---")
print("Linear Regression:", round(RMSE, 2))
print("RandomForest:", round(rf_rmse, 2))
print("LightGBM:", round(lgb_rmse, 2))
print("CatBoost:", round(cat_rmse, 2))


--- RMSE Comparison ---
Linear Regression: 3186.78
RandomForest: 1787.16
LightGBM: 1803.45
CatBoost: 1741.83


### Conclusion

All of the models used with hyperparameter tuning had better RMSE values than the baseline linear regression model. Ultimately randomforest and catboost had the best values but the run time for randomforest was significately longer. 

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed