Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

### Basic Prep

In [1]:
pip install lightgbm

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import lightgbm as lgb

import time

from sklearn.neighbors import KNeighborsRegressor

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from sklearn.metrics import accuracy_score, mean_squared_error, r2_score

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

from xgboost import XGBRegressor

data = pd.read_csv('/datasets/car_data.csv')

In [3]:
print(data.info())
print(data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

### Data cleanup

#### Handling missing values (including dropping columns)

In [4]:
data['VehicleType'].fillna('Unknown', inplace=True)
data['Gearbox'].fillna('Unknown', inplace=True)
data['FuelType'].fillna('Unknown', inplace=True)
data['Model'].fillna('Unknown', inplace=True)
data['NotRepaired'].fillna('Unknown', inplace=True)

In [5]:
data.drop(['NumberOfPictures', 'DateCrawled', 'DateCreated', 'LastSeen', 'PostalCode'], axis=1, inplace=True)

#### Filtering Unrealistic Values

In [6]:
data = data[(data['Price'] > 500) & (data['Price'] < 100000)]
data = data[(data['Power'] > 10) & (data['Power'] < 1000)]

#### Feature/Target Separation

In [7]:
X = data.drop('Price', axis=1)
y = data['Price']

## Model training

### Basic Prep

In [8]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [9]:
print(X_train.columns)

Index(['VehicleType', 'RegistrationYear', 'Gearbox', 'Power', 'Model',
       'Mileage', 'RegistrationMonth', 'FuelType', 'Brand', 'NotRepaired'],
      dtype='object')


### Feature Scaling

In [10]:
scaler = StandardScaler()
numerical_cols = ['Mileage', 'Power']

X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_val[numerical_cols] = scaler.transform(X_val[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val[numerical_cols] = scaler.transform(X_val[numerical_cols])
A value is

### Encoding Variables

#### Linear (Ordinal) Encoding

In [11]:
ordinal_cols = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

X_train_ordinal = X_train.copy()
X_val_ordinal = X_val.copy()
X_test_ordinal = X_test.copy()

X_train_ordinal[ordinal_cols] = encoder.fit_transform(X_train[ordinal_cols])
X_val_ordinal[ordinal_cols] = encoder.transform(X_val[ordinal_cols])
X_test_ordinal[ordinal_cols] = encoder.transform(X_test[ordinal_cols])

In [12]:
print(X_train_ordinal.head())

        VehicleType  RegistrationYear  Gearbox     Power  Model   Mileage  \
155902          6.0              1994      2.0 -1.436343   26.0  0.605222   
136703          6.0              2011      2.0 -0.022200   81.0 -2.363567   
311318          0.0              2017      1.0  0.492034   26.0 -2.633456   
319296          0.0              2017      2.0 -0.609895   74.0  0.605222   
1170            5.0              2003      2.0 -0.793550   43.0 -0.069503   

        RegistrationMonth  FuelType  Brand  NotRepaired  
155902                  8       7.0   24.0          1.0  
136703                 12       7.0   21.0          1.0  
311318                  3       7.0    2.0          0.0  
319296                  8       0.0   11.0          0.0  
1170                    7       3.0   24.0          1.0  


#### One-Hot-Encoding

In [13]:
X_train_onehot = pd.get_dummies(X_train, drop_first=True)
X_val_onehot = pd.get_dummies(X_val, drop_first=True)
X_test_onehot = pd.get_dummies(X_test, drop_first=True)

X_train_onehot, X_val_onehot = X_train_onehot.align(X_val_onehot, join='left', axis=1, fill_value=0)
X_train_onehot, X_test_onehot = X_train_onehot.align(X_test_onehot, join='left', axis=1, fill_value=0)

In [14]:
print(X_train.head())

       VehicleType  RegistrationYear Gearbox     Power    Model   Mileage  \
155902       small              1994  manual -1.436343  Unknown  0.605222   
136703       small              2011  manual -0.022200   cooper -2.363567   
311318     Unknown              2017    auto  0.492034  Unknown -2.633456   
319296     Unknown              2017  manual -0.609895    civic  0.605222   
1170         sedan              2003  manual -0.793550    astra -0.069503   

        RegistrationMonth  FuelType  Brand NotRepaired  
155902                  8    petrol   opel          no  
136703                 12    petrol   mini          no  
311318                  3    petrol    bmw     Unknown  
319296                  8   Unknown  honda     Unknown  
1170                    7  gasoline   opel          no  


In [15]:
def calculate_rmse(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    return rmse

### Final Data Checks

In [16]:
print(X_train_ordinal.shape)
print(y_train.shape)

(170759, 10)
(170759,)


In [17]:
print(X_train_ordinal.isnull().sum())
print(y_train.isnull().sum())

VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
dtype: int64
0


In [18]:
print(np.isinf(X_train_ordinal).sum())
print(np.isinf(y_train).sum())

VehicleType          0
RegistrationYear     0
Gearbox              0
Power                0
Model                0
Mileage              0
RegistrationMonth    0
FuelType             0
Brand                0
NotRepaired          0
dtype: int64
0


In [19]:
print(X_train_ordinal.head())
print(y_train.head())

        VehicleType  RegistrationYear  Gearbox     Power  Model   Mileage  \
155902          6.0              1994      2.0 -1.436343   26.0  0.605222   
136703          6.0              2011      2.0 -0.022200   81.0 -2.363567   
311318          0.0              2017      1.0  0.492034   26.0 -2.633456   
319296          0.0              2017      2.0 -0.609895   74.0  0.605222   
1170            5.0              2003      2.0 -0.793550   43.0 -0.069503   

        RegistrationMonth  FuelType  Brand  NotRepaired  
155902                  8       7.0   24.0          1.0  
136703                 12       7.0   21.0          1.0  
311318                  3       7.0    2.0          0.0  
319296                  8       0.0   11.0          0.0  
1170                    7       3.0   24.0          1.0  
155902      650
136703    14999
311318     1800
319296      600
1170       3800
Name: Price, dtype: int64


In [20]:
assert len(X_train_ordinal) == len(y_train), "Features and labels are misaligned!"

### Linear Regression

In [21]:
model = LinearRegression()

model.fit(X_train_onehot, y_train)

predictions = model.predict(X_val_onehot)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Root Mean Squared Error: 2851.3620797412527
Mean Squared Error: 8130265.709786362
R-squared: 0.6109900712754263


### Random Forest (w/ hyperparameter tuning)

In [22]:
rf_param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10],
}

rf = RandomForestRegressor() 

random_search = RandomizedSearchCV(
    estimator=rf, 
    param_distributions=rf_param_grid,
    n_iter=5, 
    cv=2, 
    n_jobs=-1, 
    scoring='neg_root_mean_squared_error'
)

random_search.fit(X_train_ordinal, y_train)
best_rf = random_search.best_estimator_

predictions = best_rf.predict(X_val_ordinal)
rmse = calculate_rmse(y_val, predictions)
mse = mean_squared_error(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")



Root Mean Squared Error: 1650.900851620995
Mean Squared Error: 2725473.621882926
R-squared: 0.8695938930860302
Best parameters: {'n_estimators': 100, 'max_depth': None}


### Decision Tree Regression (w/ hyperparameter tuning)

In [23]:
dt_param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
}

dt = DecisionTreeRegressor()

random_search_dt = RandomizedSearchCV(
    estimator=dt,
    param_distributions=dt_param_grid,
    n_iter=5,
    cv=2,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

random_search_dt.fit(X_train_ordinal, y_train)
best_dt = random_search_dt.best_estimator_

predictions = best_dt.predict(X_val_ordinal)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")

Root Mean Squared Error: 1944.9050277244278
Mean Squared Error: 3782655.5668677576
R-squared: 0.8190107648406113
Best parameters: {'n_estimators': 100, 'max_depth': None}


### Gradient Boosting (w/ hyperparameter tuning)

In [24]:
gb_param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
}

gb = GradientBoostingRegressor()

random_search_gb = RandomizedSearchCV(
    estimator=gb,
    param_distributions=gb_param_grid,
    n_iter=5,
    cv=2,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

random_search_gb.fit(X_train_ordinal, y_train)
best_gb = random_search_gb.best_estimator_

predictions = best_gb.predict(X_val_ordinal)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")

Root Mean Squared Error: 1783.7915561887485
Mean Squared Error: 3181912.3159302766
R-squared: 0.8477546088391734
Best parameters: {'n_estimators': 100, 'max_depth': None}


### XG Boost (w/ hyperparameter tuning)

In [25]:
xgb_param_grid = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
}

xgb = XGBRegressor()

random_search_xgb = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=xgb_param_grid,
    n_iter=5,
    cv=2,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

random_search_xgb.fit(
    X_train_ordinal, y_train,
    eval_set=[(X_val_ordinal, y_val)],
    eval_metric='rmse',
    early_stopping_rounds=10,
    verbose=False
)

best_xgb = random_search_xgb.best_estimator_

predictions = best_xgb.predict(X_val_ordinal)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")

Root Mean Squared Error: 1786.2022965730473
Mean Squared Error: 3190518.6442828286
R-squared: 0.8473428206764597
Best parameters: {'n_estimators': 100, 'max_depth': None}


### K-Nearest Neighbors (w/ hyperparameter tuning)

In [26]:
knn_param_grid = {
    'n_neighbors': [3, 5, 7, 9],
}

knn = KNeighborsRegressor()

random_search_knn = RandomizedSearchCV(
    estimator=knn,
    param_distributions=knn_param_grid,
    n_iter=5,
    cv=2,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

random_search_knn.fit(X_train_ordinal, y_train)
best_knn = random_search_knn.best_estimator_

predictions = best_knn.predict(X_val_ordinal)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")



Root Mean Squared Error: 1930.5439042971893
Mean Squared Error: 3726999.766419035
R-squared: 0.8216737354910791
Best parameters: {'n_estimators': 100, 'max_depth': None}


### LightGBM (w/ hyperparameter tuning)

In [27]:
lgb_param_grid = {
    'n_estimators': [50, 100],
    #'num_leaves': [31, 40, 50],
    'learning_rate': [0.1, 0.3],
    #'boosting_type': ['gbdt', 'dart'],
}

lgbm = lgb.LGBMRegressor()

random_search_lgb = RandomizedSearchCV(
    estimator=lgbm,
    param_distributions=lgb_param_grid,
    n_iter=5,
    cv=2,
    n_jobs=-1,
    scoring='neg_root_mean_squared_error'
)

random_search_lgb.fit(
    X_train_ordinal,
    y_train,
    eval_set=[(X_val_ordinal, y_val)],
    eval_metric='rmse',
    early_stopping_rounds=10,
    verbose=False
)

best_lgb = random_search_lgb.best_estimator_

predictions = best_lgb.predict(X_val_ordinal)
mse = mean_squared_error(y_val, predictions)
rmse = calculate_rmse(y_val, predictions)
r2 = r2_score(y_val, predictions)

print(f"Root Mean Squared Error: {rmse}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Best parameters: {random_search.best_params_}")



Root Mean Squared Error: 1657.90772000534
Mean Squared Error: 2748658.0080533046
R-squared: 0.8684845866090222
Best parameters: {'n_estimators': 100, 'max_depth': None}


## Model analysis

In [28]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    start_time = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start_time

    start_time = time.time()
    predictions = model.predict(X_test)
    prediction_time = time.time() - start_time

    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)

    return rmse, training_time, prediction_time

In [29]:
models = [
    LinearRegression(),
    RandomForestRegressor(n_estimators=10),
    DecisionTreeRegressor(),
    GradientBoostingRegressor(),
    XGBRegressor(),
    KNeighborsRegressor(),
    lgb.LGBMRegressor(n_estimators=1000)
]

results = []

In [30]:
for model in models:
    if isinstance(model, LinearRegression):
        X_train_final, X_test_final = X_train_onehot, X_test_onehot
    else:
        X_train_final, X_test_final = X_train_ordinal, X_test_ordinal
    
    rmse, training_time, prediction_time = evaluate_model(model, X_train_final, X_test_final, y_train, y_test)
    results.append({
        'Model': model.__class__.__name__,
        'RMSE': rmse,
        'Training Time (s)': training_time,
        'Prediction Time (s)': prediction_time
    })

results_df = pd.DataFrame(results)

print(results_df)

                       Model         RMSE  Training Time (s)  \
0           LinearRegression  2849.263617           6.281204   
1      RandomForestRegressor  1709.786482           4.148941   
2      DecisionTreeRegressor  2148.783089           0.615896   
3  GradientBoostingRegressor  1960.678478          11.013670   
4               XGBRegressor  1660.410448          18.045146   
5        KNeighborsRegressor  1921.494704           0.778044   
6              LGBMRegressor  1579.973464          11.605951   

   Prediction Time (s)  
0             0.108725  
1             0.175200  
2             0.019575  
3             0.072594  
4             0.104393  
5             1.415849  
6             2.799423  


## Conclusions

Seven models were used and then trained on the data, with the results for both speed and accuracy presented. That said, there were a couple of takeaways associated with this:

* In terms of speed, the Random Forest model had a faster training time at 4.15 seconds, with XGBoost having a faster predicition time at 0.104 seconds compared to Random Forest's 0.109 seconds.
* In terms of accuracy, the LightGBM model was among the most accurate, with a Root Mean-Squared Error score of around 1,580, 100 points above the second-best performer XGBoost at 1,660. The main exception was the Linear Regression model, which is the worst-performing with a RMSE score of 2,849, with the Decision Tree model perfoming similarly worse at 2,149.

What the company chooses for its app will come down to either it prioritizes speed or efficiency or simply takes a balanced approach. Random Forest is a reasonable model to use when taking a balanced approach. 

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed