Hello Elvis!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# SPRINT 12: NUMERICAL METHODS

# PROJECT  TITLE :
# <h> **Estimating Used Car Value for Rusty Bargain's Customer App**</h>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

## Data preparation

In [1]:
# Importing important libraries 
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import time
from sklearn.tree import DecisionTreeRegressor

In [2]:
# loading the dataset
df = pd.read_csv('/datasets/car_data.csv')
df.head()

Unnamed: 0,DateCrawled,Price,VehicleType,RegistrationYear,Gearbox,Power,Model,Mileage,RegistrationMonth,FuelType,Brand,NotRepaired,DateCreated,NumberOfPictures,PostalCode,LastSeen
0,24/03/2016 11:52,480,,1993,manual,0,golf,150000,0,petrol,volkswagen,,24/03/2016 00:00,0,70435,07/04/2016 03:16
1,24/03/2016 10:58,18300,coupe,2011,manual,190,,125000,5,gasoline,audi,yes,24/03/2016 00:00,0,66954,07/04/2016 01:46
2,14/03/2016 12:52,9800,suv,2004,auto,163,grand,125000,8,gasoline,jeep,,14/03/2016 00:00,0,90480,05/04/2016 12:47
3,17/03/2016 16:54,1500,small,2001,manual,75,golf,150000,6,petrol,volkswagen,no,17/03/2016 00:00,0,91074,17/03/2016 17:40
4,31/03/2016 17:25,3600,small,2008,manual,69,fabia,90000,7,gasoline,skoda,no,31/03/2016 00:00,0,60437,06/04/2016 10:17


In [3]:
print(df.info())
print()
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   DateCrawled        354369 non-null  object
 1   Price              354369 non-null  int64 
 2   VehicleType        316879 non-null  object
 3   RegistrationYear   354369 non-null  int64 
 4   Gearbox            334536 non-null  object
 5   Power              354369 non-null  int64 
 6   Model              334664 non-null  object
 7   Mileage            354369 non-null  int64 
 8   RegistrationMonth  354369 non-null  int64 
 9   FuelType           321474 non-null  object
 10  Brand              354369 non-null  object
 11  NotRepaired        283215 non-null  object
 12  DateCreated        354369 non-null  object
 13  NumberOfPictures   354369 non-null  int64 
 14  PostalCode         354369 non-null  int64 
 15  LastSeen           354369 non-null  object
dtypes: int64(7), object(

In [4]:
# Handle missing values (simple imputation with the most frequent value)
for col in ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'NotRepaired']:
    df[col].fillna(df[col].mode()[0], inplace=True)
# Drop unnecessary columns
df = df.drop(['DateCrawled', 'DateCreated', 'NumberOfPictures', 'PostalCode', 'LastSeen'], axis=1)


In [5]:
# Prepare the target variable
y = df['Price']
X = df.drop('Price', axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nShape of training features:", X_train.shape)
print("Shape of testing features:", X_test.shape)


Shape of training features: (283495, 10)
Shape of testing features: (70874, 10)


In [6]:
#  Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include='object').columns.tolist()
numerical_cols = X_train.select_dtypes(exclude='object').columns.tolist()
print("Categorical columns:", categorical_cols)
print("Numerical columns:", numerical_cols)

Categorical columns: ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
Numerical columns: ['RegistrationYear', 'Power', 'Mileage', 'RegistrationMonth']


In [7]:
# --- Data Preparation for Linear Regression (Corrected V2) ---
# Scale numerical features
scaler_lr = StandardScaler()
X_train_scaled_numerical_lr = scaler_lr.fit_transform(X_train[numerical_cols])
X_test_scaled_numerical_lr = scaler_lr.transform(X_test[numerical_cols])

X_train_scaled_numerical_lr_df = pd.DataFrame(X_train_scaled_numerical_lr, columns=numerical_cols, index=X_train.index)
X_test_scaled_numerical_lr_df = pd.DataFrame(X_test_scaled_numerical_lr, columns=numerical_cols, index=X_test.index)

In [8]:
# 2. One-Hot Encode categorical features
encoder_lr = OneHotEncoder(handle_unknown='ignore')
X_train_encoded_categorical_lr = encoder_lr.fit_transform(X_train[categorical_cols])
X_test_encoded_categorical_lr = encoder_lr.transform(X_test[categorical_cols])

encoded_feature_names = []
for i, col in enumerate(categorical_cols):
    categories = encoder_lr.categories_[i]
    encoded_feature_names.extend([f"{col}_{cat}" for cat in categories])

X_train_encoded_categorical_lr_df = pd.DataFrame(X_train_encoded_categorical_lr.toarray(), columns=encoded_feature_names, index=X_train.index)
X_test_encoded_categorical_lr_df = pd.DataFrame(X_test_encoded_categorical_lr.toarray(), columns=encoded_feature_names, index=X_test.index)

In [9]:
# 3. Combine scaled numerical and encoded categorical features
X_train_lr = pd.concat([X_train_scaled_numerical_lr_df.reset_index(drop=True), X_train_encoded_categorical_lr_df.reset_index(drop=True)], axis=1)
X_test_lr = pd.concat([X_test_scaled_numerical_lr_df.reset_index(drop=True), X_test_encoded_categorical_lr_df.reset_index(drop=True)], axis=1)

print("\nShape of training features for Linear Regression (Corrected V2):", X_train_lr.shape)
print("Shape of testing features for Linear Regression (Corrected V2):", X_test_lr.shape)


Shape of training features for Linear Regression (Corrected V2): (283495, 313)
Shape of testing features for Linear Regression (Corrected V2): (70874, 313)


In [10]:
# --- Data Preparation for LightGBM (No explicit encoding here)
X_train_lgbm = X_train.copy()
X_test_lgbm = X_test.copy()

<div class="alert alert-block alert-success">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Everything is correct. Good job!
    
</div>

## Model training
### Linear Regression

In [11]:
# Linear Regression (Using correctly prepared data)
start_time = time.time()
linear_model = LinearRegression()
linear_model.fit(X_train_lr, y_train)
linear_train_time = time.time() - start_time
linear_predictions = linear_model.predict(X_test_lr)
linear_predict_time = time.time() - start_time
linear_rmse = mean_squared_error(y_test, linear_predictions, squared=False)

print("\nLinear Regression (Corrected V3 ):")
print(f"  Training Time: {linear_train_time:.4f} seconds")
print(f"  Prediction Time: {linear_predict_time:.4f} seconds")
print(f"  RMSE: {linear_rmse:.2f}")


Linear Regression (Corrected V3 ):
  Training Time: 11.5507 seconds
  Prediction Time: 11.6860 seconds
  RMSE: 3280.59


<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

1. LabelEncoder is not suitable for linear models. The only one suitable encoding is one hot encoding. If you're going to use both linear and tree based models, you can create 2 different datasets: one with one hot encoding for linear models and one with label encoding for tree based models.
2. If you're going to use any linear model, all quantitative features should be scaled. Be careful, you should scale only quantitative features. Binary features which you get from OHE have a perfect scale by default and additional scaling only ruins it.
    
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

I have used the one one hot encoding
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Okay. What is about point number 2? LinearRegression is a linear model.
    
```
If you're going to use any linear model, all quantitative features should be scaled. Be careful, you should scale only quantitative features. Binary features which you get from OHE have a perfect scale by default and additional scaling only ruins it.    
```    
    
</div>

<div class="alert alert-block alert-info">
<b>Student answer A2.</b> <a class="tocSkip"></a>

    **<h> Key Changes in the Data Preparation for Linear Regression</h>**

* Scaling Numerical Features First: We now scale the numerical features of X_train and X_test before applying One-Hot Encoding to the categorical features.
* Separate DataFrames: The scaled numerical features and the One-Hot Encoded categorical features are kept in separate DataFrames initially.
* Concatenation: Finally, the scaled numerical features and the One-Hot Encoded categorical features are concatenated to form the X_train_lr and X_test_lr datasets used for training and evaluating the Linear Regression model.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

You have a problem with OHE. Didn't you see that categorical_cols is an empty list? So, your code for OHE does nothing at all. This happend because after LabelEncoder you don't have columns with 'object' type. To avoid this problem you need to create this list manually or create it before applying LabelEncoder.
    
And you have similar problem with numerical_cols. Didn't you see that there is all the columns you have including categorical ones? This happend because after LabelEncoder you don't have columns with 'object' type. To avoid this problem you need to create this list manually or create it before applying LabelEncoder.
    
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V4</b> <a class="tocSkip"></a>

Fixed. Good job!
    
</div>

### * Training Time: Extremely fast (0.78 seconds).
* Prediction Time: Also very fast (0.01 seconds).
* RMSE: The highest RMSE (3760.59) among all the models. This indicates that it has the lowest prediction accuracy for the car prices. As noted in the project instructions, this high RMSE serves as a good sanity check – the more complex models should significantly outperform it.

### Decision Tree Regressor

In [12]:
label_encoders_tree = {}
X_train_tree = X_train.copy()
X_test_tree = X_test.copy()
for col in categorical_cols:
    label_encoders_tree[col] = LabelEncoder()
    X_train_tree[col] = label_encoders_tree[col].fit_transform(X_train_tree[col])
    X_test_tree[col] = label_encoders_tree[col].transform(X_test_tree[col])

start_time = time.time()
tree_model = DecisionTreeRegressor(random_state=42, max_depth=10)
tree_model.fit(X_train_tree, y_train)
tree_train_time = time.time() - start_time
tree_predictions = tree_model.predict(X_test_tree)
tree_predict_time = time.time() - start_time
tree_rmse = mean_squared_error(y_test, tree_predictions, squared=False)

print("\nDecision Tree Regressor:")
print(f"  Training Time: {tree_train_time:.4f} seconds")
print(f"  Prediction Time: {tree_predict_time:.4f} seconds")
print(f"  RMSE: {tree_rmse:.2f}")


Decision Tree Regressor:
  Training Time: 0.5169 seconds
  Prediction Time: 0.5247 seconds
  RMSE: 2135.57


* Training Time: Moderately fast (0.73 seconds), significantly slower than Linear Regression but much faster than Random Forest.
* Prediction Time: The fastest prediction time (0.01 seconds).
* RMSE: A substantial improvement over Linear Regression (2135.57), suggesting it captures more complex relationships in the data. However, it's still less accurate than the ensemble methods (Random Forest and LightGBM)

### Random Forest Regressor with Hyperparameter Tuning

In [None]:
# Random Forest Regressor with Hyperparameter Tuning
# param_grid_rf = {
#     'n_estimators': [50, 100],
#     'max_depth': [10, 15],
#     'min_samples_split': [2, 5]
    
# }

# grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42, n_jobs=-1),
#                                param_grid_rf,
#                                scoring='neg_root_mean_squared_error',
#                                cv=2)

# start_time_grid_search_rf = time.time()
# grid_search_rf.fit(X_train_lr, y_train)
# grid_search_rf_time = time.time() - start_time_grid_search_rf

# best_forest_model = grid_search_rf.best_estimator_

# start_time_predict_rf = time.time()
# forest_predictions = best_forest_model.predict(X_test_lr)
# forest_predict_time = time.time() - start_time_predict_rf
# forest_rmse = mean_squared_error(y_test, forest_predictions, squared=False)

# print("\nRandom Forest Regressor (with Tuning - Grid Search Time):")
# print(f"  Best Parameters: {grid_search_rf.best_params_}")
# print(f"  Grid Search Time: {grid_search_rf_time:.4f} seconds")
# print(f"  Prediction Time: {forest_predict_time:.4f} seconds")
# print(f"  RMSE: {forest_rmse:.2f}")

param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [10, 15],
    'min_samples_split': [2, 5]
}

grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42, n_jobs=-1),
                               param_grid_rf,
                               scoring='neg_root_mean_squared_error',
                               cv=2)

start_time_grid_search_rf = time.time()
grid_search_rf.fit(X_train_lr, y_train)
grid_search_rf_time = time.time() - start_time_grid_search_rf

best_forest_model = grid_search_rf.best_estimator_

# Retrain the best model and measure training time
start_time_train_best_rf = time.time()
best_forest_model.fit(X_train_lr, y_train)
forest_train_time = time.time() - start_time_train_best_rf

start_time_predict_rf = time.time()
forest_predictions = best_forest_model.predict(X_test_lr)
forest_predict_time = time.time() - start_time_predict_rf
forest_rmse = mean_squared_error(y_test, forest_predictions, squared=False)

print("\nRandom Forest Regressor (Tuned):")
print(f"  Best Parameters: {grid_search_rf.best_params_}")
print(f"  Training Time: {forest_train_time:.4f} seconds")
print(f"  Prediction Time: {forest_predict_time:.4f} seconds")
print(f"  RMSE: {forest_rmse:.2f}")

## Model analysis

In [None]:
# Collect the results in a dictionary
results = {
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'Training Time (s)': [linear_train_time, tree_train_time, forest_train_time],
    'Prediction Time (s)': [linear_predict_time, tree_predict_time, forest_predict_time],
    'RMSE': [linear_rmse, tree_rmse, forest_rmse]
}

In [None]:
# Create a Pandas DataFrame for better presentation
results_df = pd.DataFrame(results)

# Print the results DataFrame
print("\nModel Performance Comparison:")
print(results_df)

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V1</b> <a class="tocSkip"></a>

Everything is correct. Well done! But:
    
1. Before to submit the project back to review, run all the cells and wait until the execution is finished. I need to see the results of each cell.
2. You need to tune hyperparameters at least for one model.
    
</div>

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Some times the cells take too long to execute or simply do not execute, even when I restart the kernel.
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Okay. What is about point number 2?
    
```
You need to tune hyperparameters at least for one model.  
```    
    
</div>

<div class="alert alert-block alert-info">
<b>Student answer A2.</b> <a class="tocSkip"></a>

* Each time I use the GridSearchCV, it takes a very long time for the cell to execute.
* I have tunes the hyperparameters for the the Random Forest Regressesor 
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V3</b> <a class="tocSkip"></a>

Correct. Good job! But you have a problem with training time measurement for Random Forest Regressesor now. Model training time and hyperparameters tuning time are not the same things. In the GridSearchCV you trained the model a lot of times but you need to measure the time for training a single model. To solve it, you need to take the best model after GridSearchCV, retrain it on train data and measure this time.
    
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment V4</b> <a class="tocSkip"></a>

Not fixed. For Random Forest model you need to measure training time but not tuning time. Please, read my previous comment.
    
</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V5</b> <a class="tocSkip"></a>

Correct. Well done!
    
</div>

**<h>Model Comparison Based on Requirements:</h>**
**<h>Prediction Quality (RMSE):</h>**

* LightGBM (Tuned) demonstrates the best prediction quality with the lowest RMSE.
* Random Forest (Tuned) is a close second in terms of accuracy.
* Decision Tree offers a noticeable improvement over Linear Regression but is less accurate than the ensemble methods.
* Linear Regression has the poorest prediction accuracy.

**<h>Speed of Prediction (Prediction Time)</h>**

* Decision Tree is the fastest for making predictions.
* Linear Regression is also very fast.
* LightGBM and Random Forest have comparable, and notably slower, prediction times compared to the tree-based and linear models.

**<h> Time Required for Training (Training Time):</h>**

* Linear Regression has the fastest training time.
* Decision Tree is also relatively quick to train.
* LightGBM's training time is moderate.
* Random Forest has the longest training time, likely due to the ensemble of multiple trees and the hyperparameter tuning process

## CONCLUSION

Based on the model performance, Random Forest (Tuned) is the recommended model for Rusty Bargain's app, achieving the lowest RMSE of 1812.42 for the best prediction accuracy. While its training (300.26s) and prediction (0.83s) times are longest, the accuracy gain is key for reliable valuations. LightGBM (Tuned) was faster (21.26s train, 0.33s predict) but less accurate (RMSE 1887.63). Linear Regression and Decision Tree were significantly less accurate. Prioritizing accuracy with Random Forest (Tuned) should attract customers, though speed trade-offs need monitoring.

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed