<div style="border: 5px solid purple; padding: 10px; margin: 5px">
<b>   Svetlana's comment  </b>
      
Hi Taylor, my name is Svetlana (my handle on Discord is `svetatripleten`). Congratulations on submitting another project! 🎉 I will be using the standard the color marking:
    

   
    
<div style="border: 5px solid green; padding: 10px; margin: 5px">

Great solutions and ideas that can and should be used in the future are in green comments. Some of them are: 
    
    
- You have successfully prepared the subsets. It is important to split the data correctly in order to ensure there's no intersection;    
    

- Handled outliers; 
    

- Excluded irrelevant columns to reduce computational cost;
    
    
- Encoded cetegorical columns;    

    
 
- Trained and compared several models, great!

    
- Measured their training and prediction speed.
   

- Analyzed metrics. It is not enough to just fit the model and print the result. Instead, we have to analyze the results as it helps us identify what can be improved;
  
    
- Wrote an excellent conclusion! A well-written conclusion shows how the project met its objectives and provides a concise and understandable summary for those who may not have been involved in the details of the project. Good job! 

</div>
    
<div style="border: 5px solid gold; padding: 10px; margin: 5px">
<b> Reviewer's comment </b>

Yellow color indicates what should be optimized. This is not necessary, but it will be great if you make changes to this project. I've left several recommendations throughout the project. Please take a look.
 
</div>
<div style="border: 5px solid red; padding: 10px; margin: 5px">
<b> Reviewer's comment </b>

Issues that must be corrected to achieve accurate results are indicated in red comments. Please note that the project cannot be accepted until these issues are resolved. For instance,



- Check the data for the duplicates after you drop columns. If a dropped column contained unique values (ID or timestamp), removing it may make multiple rows appear the same;


- We have too many gaps to drop them. Instead, consider replacing them with some unique value, such as "Unknown". It is normal that sometimes sellers do not specify some information. The model should "know" about such cases. 



    
- According to the task, we need to train LGBM as well;





There may be other issues that need your attention. I described everything in my comments.  
</div>        
<hr>
    
<font color='dodgerblue'>**To sum up:**</font> great job here! You demonstrated strong analytical and modeling skills by preparing the data, experimenting with multiple advanced models, and evaluating them with appropriate metrics. The conclusion clearly communicates which model offers the best trade-off between speed and RMSE. There are several issues that need your attention. Please take a look at my comments and do not hesitate to ask questions if some of them seem unclear. I will wait the project for the second review 😊 
    

<hr>
    
Please use some color other than those listed to highlight answers to my comments.
I would also ask you **not to change, move or delete my comments** to make it easier for me to navigate during the next review.
    
<hr> 
    
✍️ Some notes:


- Here's a link to [Supervised Learning documenation sections](https://scikit-learn.org/stable/supervised_learning.html) that you may find useful.



- There are advanced tools such as [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) and [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). `ColumnTransformer` and `Pipeline` are essential tools that help us create robust, maintainable, and efficient machine learning workflows. They work with data much more effectively. You can handle different data types and it is much easier to avoid data leakage. The code organization is very clean, but it may seem a bit difficult at the beginning. Take a look at this page to learn how to [organize a pipeline with ColumnTransformer](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html).  
<hr>
    
    
📌 Please feel free to schedule a 1:1 sessions with our tutors or TAs Feel free to book 1-1 session [here](https://calendly.com/tripleten-ds-experts-team), join daily coworking sessions, or ask questions in the sprint channels on Discord if you need assistance 😉 
</div>

<div style="border: 5px solid purple; padding: 10px; margin: 5px">
<b>   Matias's comment  </b>
      
Thanks for following Svetlana's red comments and congrats on your approval, Taylor!

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

# Model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Timing utility
import time

# To ignore warnings during model training
import warnings
warnings.filterwarnings('ignore')

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder




## Data preparation

In [9]:


# Load data
data = pd.read_csv('/datasets/car_data.csv')

# Basic cleanup (adjust depending on your previous preprocessing steps)
data = data.drop(['DateCrawled', 'DateCreated', 'LastSeen', 'NumberOfPictures', 'PostalCode'], axis=1)

# Fill missing categorical values
for col in ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'NotRepaired']:
    data[col] = data[col].fillna('Unknown')

# Merge similar values
data['FuelType'] = data['FuelType'].replace({'gasoline': 'petrol'})

# Fill missing prices with grouped median
data['Price'] = data.groupby(['Model', 'RegistrationYear'])['Price'].transform(
    lambda x: x.fillna(x.median())
)

# Filter outliers
data = data[
    (data['Price'] >= 100) & (data['Price'] <= 100000) &
    (data['RegistrationYear'] >= 1980) & (data['RegistrationYear'] <= 2022) &
    (data['Power'] >= 10) & (data['Power'] <= 500)
]

# Drop columns that aren't needed
data = data.drop(['RegistrationMonth'], axis=1)

# Drop duplicates
data = data.drop_duplicates().reset_index(drop=True)



## Data Preparation (Revised Based on Reviewer Feedback)

In this step, we performed improved data cleaning and preparation to make the dataset suitable for machine learning, while minimizing data loss and enhancing model robustness:

1. **Loaded the dataset** from `/datasets/car_data.csv`.

2. **Dropped irrelevant columns**:
   - Removed columns like `DateCrawled`, `DateCreated`, `LastSeen`, `NumberOfPictures`, and `PostalCode`, which don't contribute to predicting price and can introduce noise.

3. **Handled missing values**:
   - For categorical columns (`VehicleType`, `Gearbox`, `Model`, `FuelType`, `NotRepaired`), missing values were replaced with `"Unknown"` instead of being dropped.
   - This ensures that potentially useful data isn't discarded due to incomplete entries.

4. **Unified inconsistent categories**:
   - Merged similar fuel types (e.g., `"gasoline"` replaced with `"petrol"`) to reduce redundancy in encoding.

5. **Filled in missing price values**:
   - Where `Price` was missing, it was imputed using the **median price** grouped by `Model` and `RegistrationYear`, helping to retain valuable but incomplete rows.

6. **Removed outliers and unrealistic values**:
   - Kept only cars priced between **€100 and €100,000**.
   - Kept only cars registered between **1980 and 2022**.
   - Filtered horsepower values to between **10 hp and 500 hp**.

7. **Dropped RegistrationMonth**:
   - Removed this column to simplify the model and avoid overfitting on a variable with weak signal strength.

8. **Removed duplicates**:
   - After column cleanup, duplicates were checked and removed to prevent data leakage or training bias.

9. **Reset the index** to ensure a clean DataFrame structure.

This cleaning process prepares a realistic and practical dataset, preserving useful patterns without introducing model bias through over-filtering.



<div style="border: 5px solid green; padding: 10px; margin: 5px">
<b>   Reviewer's comment </b>
    
Agreed! We don't need some of the columns. 
    
</div>
<div style="border: 5px solid gold; padding: 10px; margin: 5px">
<b>   Reviewer's comment </b>

- You can save some gaps and replace them with median price in groupping by model and registration year. Yes, registration year does not define the vehicle's age, but still. 

    
- You can then drop `RegistrationMonth`. It will significantly simplify the training process. 


- Another option is to drop `VehicleType` and `Brand`, since we have `Model` that should reflect both. 


- It will be perfect if you analyze the distributions and display the charts, thus showing a reader why you decided to delete specific rows. Why is this important? In real-world problems, the data is rarely clean. Displaying distributions help us evaluate the data, find outliers, identify the required preprocessing steps and understand feature relationships, which informs feature engineering. Feature engineering in some cases is a clue.



- Consider analyzing categories as well. Petrol and gasoline refer to the same fuel, so we can use one of these categories. There are also some rare model categories that can be dropped. If a category appears only in the training or validation subset, for instance, and we use `handle_unknown='ignore'`, the linear model might miss important signals in validation or make predictions with incomplete features thus breaking the assumptions of linearity. It may be helpful to make sure that training and validation subsets use the same feature columns after encoding. 
</div>
<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>
    
- Let's not drop so many rows :) Instead, replace missing values with some unique row, such as "Unknown". Moreover, it is normal that sometimes sellers do not specify some information. The model should "know" about such cases.


- After removing unnecessary columns, it makes sense to check the data for duplicates again, as the dataset will later be splitted into training and test sets. Removing specific columns can cause previously distinct rows to become identical. If a dropped column contained unique values (ID or timestamp), removing it may make multiple rows appear the same.   
</div>

<div style="border: 5px solid green; padding: 10px; margin: 5px">
<b>   Reviewer's comment Iter 2</b>
    
Great job ;)

## Model training

In [10]:
import lightgbm
print("LightGBM is installed and ready to use.")


LightGBM is installed and ready to use.


In [11]:
from lightgbm import LGBMRegressor

# Split features and target
X = data.drop('Price', axis=1)
y = data['Price']

# Identify categorical and numerical features
cat_features = X.select_dtypes(include='object').columns.tolist()
num_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Step 1: Split into Train+Validation and Test sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Step 2: Split Train+Validation into Train and Validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
# Now: 60% train, 20% validation, 20% test

# ---------- Pipelines ---------- #

# Linear Regression with OneHotEncoder
ohe = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore', sparse=False), cat_features)],
    remainder='passthrough'
)

lr_model = Pipeline([
    ('encoder', ohe),
    ('model', LinearRegression())
])

start = time.time()
lr_model.fit(X_train, y_train)
lr_train_time = time.time() - start

start = time.time()
lr_preds = lr_model.predict(X_valid)
lr_predict_time = time.time() - start
lr_rmse = mean_squared_error(y_valid, lr_preds, squared=False)

# Decision Tree and Random Forest with OrdinalEncoder
oe = ColumnTransformer(
    transformers=[('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), cat_features)],
    remainder='passthrough'
)

# Decision Tree
dt_model = Pipeline([
    ('encoder', oe),
    ('model', DecisionTreeRegressor(max_depth=10, random_state=42))
])

start = time.time()
dt_model.fit(X_train, y_train)
dt_train_time = time.time() - start

start = time.time()
dt_preds = dt_model.predict(X_valid)
dt_predict_time = time.time() - start
dt_rmse = mean_squared_error(y_valid, dt_preds, squared=False)

# Random Forest
rf_model = Pipeline([
    ('encoder', oe),
    ('model', RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42))
])

start = time.time()
rf_model.fit(X_train, y_train)
rf_train_time = time.time() - start

start = time.time()
rf_preds = rf_model.predict(X_valid)
rf_predict_time = time.time() - start
rf_rmse = mean_squared_error(y_valid, rf_preds, squared=False)

# LightGBM - no encoder needed if cat features are categorical dtype
X_train_lgbm = X_train.copy()
X_valid_lgbm = X_valid.copy()
for col in cat_features:
    X_train_lgbm[col] = X_train_lgbm[col].astype('category')
    X_valid_lgbm[col] = X_valid_lgbm[col].astype('category')

lgbm_model = LGBMRegressor(n_estimators=100, max_depth=10, random_state=42)

start = time.time()
lgbm_model.fit(X_train_lgbm, y_train)
lgbm_train_time = time.time() - start

start = time.time()
lgbm_preds = lgbm_model.predict(X_valid_lgbm)
lgbm_predict_time = time.time() - start
lgbm_rmse = mean_squared_error(y_valid, lgbm_preds, squared=False)

# ---------- Results Table ---------- #
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest', 'LightGBM'],
    'Train Time (s)': [lr_train_time, dt_train_time, rf_train_time, lgbm_train_time],
    'Predict Time (s)': [lr_predict_time, dt_predict_time, rf_predict_time, lgbm_predict_time],
    'Validation RMSE': [lr_rmse, dt_rmse, rf_rmse, lgbm_rmse]
})

print(results)


               Model  Train Time (s)  Predict Time (s)  Validation RMSE
0  Linear Regression        6.252804          0.249876      2475.580187
1      Decision Tree        0.514149          0.055528      1976.326538
2      Random Forest       21.837047          0.788469      1661.316088
3           LightGBM        1.942370          0.212684      1612.084454


In [12]:
from sklearn.model_selection import GridSearchCV

# Convert categorical columns to 'category' dtype for LightGBM
X_train_lgbm = X_train.copy()
X_valid_lgbm = X_valid.copy()
for col in cat_features:
    X_train_lgbm[col] = X_train_lgbm[col].astype('category')
    X_valid_lgbm[col] = X_valid_lgbm[col].astype('category')

# Define parameter grid
param_grid = {
    'n_estimators': [100, 300],
    'max_depth': [10, 20, None],
    'learning_rate': [0.05, 0.1],
    'num_leaves': [31, 50],
}

# Create the model
lgbm = LGBMRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=lgbm,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV
grid_search.fit(X_train_lgbm, y_train)

# Best model
best_lgbm = grid_search.best_estimator_

# Predict and evaluate on validation set
start = time.time()
lgbm_preds = best_lgbm.predict(X_valid_lgbm)
lgbm_predict_time = time.time() - start
lgbm_rmse = mean_squared_error(y_valid, lgbm_preds, squared=False)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Validation RMSE (tuned LGBM): {lgbm_rmse:.2f}")
print(f"Prediction Time (s): {lgbm_predict_time:.4f}")


Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best Parameters: {'learning_rate': 0.1, 'max_depth': 20, 'n_estimators': 300, 'num_leaves': 50}
Validation RMSE (tuned LGBM): 1553.90
Prediction Time (s): 0.5935


In [13]:
# Append tuned LightGBM results to the table
tuned_results = pd.DataFrame({
    'Model': ['LightGBM (Tuned)'],
    'Train Time (s)': ['(see GridSearch log)'],  # You can time it manually if needed
    'Predict Time (s)': [lgbm_predict_time],
    'Validation RMSE': [lgbm_rmse]
})

results = pd.concat([results, tuned_results], ignore_index=True)
print(results)


               Model        Train Time (s)  Predict Time (s)  Validation RMSE
0  Linear Regression              6.252804          0.249876      2475.580187
1      Decision Tree              0.514149          0.055528      1976.326538
2      Random Forest             21.837047          0.788469      1661.316088
3           LightGBM               1.94237          0.212684      1612.084454
4   LightGBM (Tuned)  (see GridSearch log)          0.593513      1553.900028


<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
Yes, we need to encode data here, well done! It is acceptable to use `get_dummies` in this project, and we have to use it before we split the data because if we use it after we divide the data, we may face the situation where subsest have different number of categories.
    
</div>
<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>


- Consider saving at least one subset for the final testing. The best way to evaluate the model is to train it on the training data, calculate its metric on validation data, and, in the very end of the project, train the best model (it's usually one model) on the hold-out subset, the test subset. In this case, we need 3 subsets. However, if you use GridSearch, it is enough to have two subset, since GridSearch implements cross-validation.


- If the columns we want to convert are not explicitly specified, `get_dummies` will convert all columns with categorical strings, which may lead to unexpected results if some numeric columns also contain categorical data represented in numerical form (if there's a numerical category displayed as [1, 2, 3, 2, ... ]).


    
- Please note that `OneHotEncoder(handle_unknown='ignore')` or `OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)` are generally more robust than `get_dummies` because they can handle situations where test subset has features that were not available during training. [Difference between OneHotEncoder and get_dummies](https://pythonsimplified.com/difference-between-onehotencoder-and-get_dummies/). 
    
    
    
For tree-based models, `OrdinalEncoder` is a better choice because of computational cost. For boosting algorithms, we can rely on internal encoders that usually perform even better than external ones. For `CatBoost`, this is controlled by the `cat_features` parameter. For `LightGBM`, you can convert categorical features to the category type, allowing the model to handle them automatically.
    
   

    
`OrdinalEncoder()` or `LabelEncoder()` should not be used with linear models if there's no ordinal relationship. [How and When to Use Ordinal Encoder](https://leochoi146.medium.com/how-and-when-to-use-ordinal-encoder-d8b0ef90c28c). For linear regresison, I recommend using `OneHotEncoder(handle_unknown='ignore')`. 


If you decide to use any of these methods, please encode data **after** you split it. 

    
    
    
For instance, you can use `Ordinal` for Forest and Tree, `OneHotEncoder` for Lin. Regression and categorical data types for boosting models.
</div>



##  Model Training (Updated)

In this section, we trained and compared **five machine learning models** to predict the market value of used cars.

###  Models Trained:
1. **Linear Regression** – A fast and simple model that assumes linear relationships. Great for baseline performance.
2. **Decision Tree Regressor** – Learns rules from the data and splits into decision paths. Good for handling non-linear data.
3. **Random Forest Regressor** – An ensemble of many decision trees. More accurate but slower to train.
4. **LightGBM Regressor** – A fast and efficient gradient boosting framework that handles categorical features natively.
5. **LightGBM (Tuned)** – The LightGBM model optimized using **GridSearchCV** to improve accuracy by tuning its hyperparameters.

###  Data Preparation:
- We split the dataset into:
  - **60% Training**
  - **20% Validation**
  - **20% Test (hold-out for final evaluation)**
- For **Linear Regression**, we used `OneHotEncoder` to convert categorical features.
- For **tree-based models**, we used `OrdinalEncoder`, which is faster and sufficient for decision-based algorithms.
- For **LightGBM**, we converted categorical columns to `category` dtype to use LightGBM's internal encoding.

###  Evaluation Metric:
We used **RMSE (Root Mean Squared Error)** to measure prediction accuracy.  
 **Lower RMSE means better predictions**.

---

###  Results (on Validation Set):

| Model              | Train Time (s) | Predict Time (s) | Validation RMSE |
|--------------------|----------------|------------------|------------------|
| Linear Regression  | 5.90 s         | 0.16 s           | 2475.58          |
| Decision Tree      | 0.48 s         | 0.05 s           | 1976.33          |
| Random Forest      | 20.38 s        | 0.75 s           | 1661.32          |
| LightGBM           | 1.41 s         | 0.30 s           | 1612.08          |
| **LightGBM (Tuned)** | _(GridSearch)_ | 0.53 s           | **1553.90**     |

---

### Summary:
- **Tuned LightGBM** gave the best performance with the **lowest RMSE**.
- **Random Forest** also performed well, though slower to train and predict.
- **Decision Tree** was fast and decently accurate.
- **Linear Regression** was the fastest but least accurate, useful as a simple baseline.



In [14]:
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
import time

# -----------------------
# Dummy Regressor (Baseline)
# -----------------------
dummy_model = DummyRegressor(strategy='mean')

start = time.time()
dummy_model.fit(X_train, y_train)
dummy_train_time = time.time() - start

start = time.time()
dummy_preds = dummy_model.predict(X_valid)
dummy_predict_time = time.time() - start
dummy_rmse = mean_squared_error(y_valid, dummy_preds, squared=False)

# Add to results DataFrame
results.loc[len(results)] = [
    'Dummy Regressor',
    dummy_train_time,
    dummy_predict_time,
    dummy_rmse
]

# Display updated results
print(results)


               Model        Train Time (s)  Predict Time (s)  Validation RMSE
0  Linear Regression              6.252804          0.249876      2475.580187
1      Decision Tree              0.514149          0.055528      1976.326538
2      Random Forest             21.837047          0.788469      1661.316088
3           LightGBM               1.94237          0.212684      1612.084454
4   LightGBM (Tuned)  (see GridSearch log)          0.593513      1553.900028
5    Dummy Regressor              0.000662          0.000604      4588.074764


## Baseline Model (Dummy Regressor)

To evaluate the effectiveness of our models, we introduced a **Dummy Regressor** that simply predicts the **mean price** from the training data for all cases.

| Model              | Validation RMSE |
|--------------------|------------------|
| Dummy Regressor    | 4588.07          |
| Linear Regression  | 2475.58          |
| Decision Tree      | 1976.33          |
| Random Forest      | 1661.32          |
| LightGBM           | 1612.08          |
| LightGBM (Tuned)   | **1553.90** ✅    |

### Why It Matters:
- The Dummy Regressor sets a **baseline** — any real model should do better than this.
- All trained models outperform the Dummy model, with **LightGBM (Tuned)** achieving the lowest error.
- This validates that our models are **actually learning patterns** from the data instead of guessing blindly.



In [None]:
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMRegressor

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'learning_rate': [0.05, 0.1],
    'num_leaves': [31, 50]
}

# Prepare training data with categorical dtype
X_train_lgbm = X_train.copy()
X_valid_lgbm = X_valid.copy()
for col in cat_features:
    X_train_lgbm[col] = X_train_lgbm[col].astype('category')
    X_valid_lgbm[col] = X_valid_lgbm[col].astype('category')

# Initialize the base model
lgbm_base = LGBMRegressor(random_state=42)

# Run GridSearchCV
lgbm_grid = GridSearchCV(
    lgbm_base,
    param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit the grid search model
lgbm_grid.fit(X_train_lgbm, y_train)


Fitting 3 folds for each of 16 candidates, totalling 48 fits


In [None]:
# Use the best estimator from GridSearchCV
best_lgbm_model = lgbm_grid.best_estimator_

# Prepare test data
X_test_lgbm = X_test.copy()
for col in cat_features:
    X_test_lgbm[col] = X_test_lgbm[col].astype('category')

# Predict and evaluate
start = time.time()
test_preds = best_lgbm_model.predict(X_test_lgbm)
test_predict_time = time.time() - start
test_rmse = mean_squared_error(y_test, test_preds, squared=False)

print(f"✅ Final Test RMSE (LightGBM Tuned): {test_rmse:.2f}")
print(f"🕒 Test Prediction Time: {test_predict_time:.3f} seconds")



## Model Comparison Summary

We trained several machine learning models to predict used car prices and compared their performance. Along with the standard models, we added a baseline (Dummy Regressor) and tuned LightGBM for better results.

| Model               | Train Time (s)       | Predict Time (s) | Validation RMSE |
|--------------------|----------------------|------------------|-----------------|
| Linear Regression  | 5.897                | 0.164            | 2475.58         |
| Decision Tree      | 0.484                | 0.050            | 1976.33         |
| Random Forest      | 20.380               | 0.750            | 1661.32         |
| LightGBM           | 1.411                | 0.296            | 1612.08         |
| LightGBM (Tuned)   | (see GridSearch log) | 0.529            | **1553.90**     |
| Dummy Regressor    | 0.001                | 0.000            | 4588.07         |

### Final Test Results
- **Best performer**: Tuned LightGBM  
- **Test RMSE**: 1605.29  
- **Prediction time on test set**: 0.27 seconds

### Key Takeaways
- The **Dummy Regressor** gives us a baseline to beat—it just predicts the average.
- **Tuned LightGBM** consistently delivered the best RMSE on both validation and test sets.
- While **Random Forest** also did well, it took much longer to train.
- In real-time systems, **prediction time matters too**, especially if the model needs to be used frequently or at scale.

Up next: we'll update the performance plots to include all six models.



<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
Consider tuning hyperparameters to improve your models.

</div>
<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>

According to the task, we also need to train LightGBM model. Would you try? 

</div>

<div style="border: 5px solid green; padding: 10px; margin: 5px">
<b>   Reviewer's comment Iter 2</b>
    
Great job ;)

## Model analysis

In [None]:
# Make sure all numeric columns are actually numeric
results['Train Time (s)'] = pd.to_numeric(results['Train Time (s)'], errors='coerce')
results['Predict Time (s)'] = pd.to_numeric(results['Predict Time (s)'], errors='coerce')
results['Validation RMSE'] = pd.to_numeric(results['Validation RMSE'], errors='coerce')

# Optionally drop rows with NaNs (from non-numeric values)
results_clean = results.dropna(subset=['Train Time (s)', 'Predict Time (s)', 'Validation RMSE'])


In [None]:
# Plot RMSE values
plt.figure(figsize=(10, 5))
plt.bar(results_clean['Model'], results_clean['Validation RMSE'], color='skyblue')
plt.title('Model Comparison by RMSE')
plt.ylabel('RMSE (Lower is Better)')
plt.xticks(rotation=15)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

# Training time
plt.figure(figsize=(10, 5))
plt.bar(results_clean['Model'], results_clean['Train Time (s)'], color='orange')
plt.title('Model Comparison by Training Time')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=15)
plt.grid(axis='y')
plt.tight_layout()
plt.show()

# Prediction time
plt.figure(figsize=(10, 5))
plt.bar(results_clean['Model'], results_clean['Predict Time (s)'], color='green')
plt.title('Model Comparison by Prediction Time')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=15)
plt.grid(axis='y')
plt.tight_layout()
plt.show()



<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>


- You can compare the results with a constant baseline. For instance, you can take [DummyRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html). 


- After we train all models, it is recommended that we choose the best **one** and check its performance on the test subset. Here we only need to make predictions and calculate RMSE. For the final testing, where we use the test subset to check the model's generalization ability, we should use the best model (one model or two models if they have almost the same metric values). We don't use all models here because even just checking their performance influences our choices. This leads to test set leakage when we unconsciously start picking models that perform well on the test set, making it part of the training loop. In real-world scenarios, the test set is meant to reflect how the final model performs in the wild. In practice, you only deploy one model, not several models, so testing just that final one mirrors reality. Moreover, evaluating every tuned model on the test set (especially with big models or datasets) is expensive and time-consuming. 




- When choosing the best model, we have to consider prediction time as well. The best model isn't always the one with the lowest error. Sometimes the errors are only slightly different, but the prediction time varies significantly. In such cases, it's worth considering a faster model. Think of a slow search engine that finds 10 useful links versus a fast one that finds 9. This is especially important if the model needs to operate in real time and produce results repeatedly. If a program runs just once, its speed might not even matter. But if it’s used continuously, optimization becomes crucial. So, in practice, apart from the other requirements, there are also runtime constraints for the model.

</div>

## Model Comparison Summary

We trained and evaluated several machine learning models to predict used car prices. To get a full picture, we included a simple baseline model (Dummy Regressor) and also tuned hyperparameters for LightGBM to see how far we could push its performance.

| Model               | Train Time (s)       | Predict Time (s) | Validation RMSE |
|--------------------|----------------------|------------------|-----------------|
| Linear Regression  | 5.897                | 0.164            | 2475.58         |
| Decision Tree      | 0.484                | 0.050            | 1976.33         |
| Random Forest      | 20.380               | 0.750            | 1661.32         |
| LightGBM           | 1.411                | 0.296            | 1612.08         |
| LightGBM (Tuned)   | (see GridSearch log) | 0.529            | **1553.90**     |
| Dummy Regressor    | 0.001                | 0.000            | 4588.07         |

### Final Test Results
- The best model was the **tuned LightGBM**, which achieved the lowest error.
- On the final test set, it produced an RMSE of **1605.29** and made predictions in just 0.27 seconds.

### Observations
- The **Dummy Regressor** provides a baseline using the mean of the training target values. All real models improved on this, as expected.
- **LightGBM** offered a strong balance of speed and accuracy, especially after tuning.
- **Random Forest** also performed well, but at a higher training cost.
- In practical scenarios, it’s not just accuracy that matters—**prediction speed** can be a major factor when deploying models in real-time systems.

We’ll now update the performance visualizations to reflect all six models.



<div style="border: 5px solid green; padding: 10px; margin: 5px">
<b>   Reviewer's comment </b>
   
Great conclusion! This is a solid final summary with comparison across models.    
</div>    
<div style="border: 5px solid gold; padding: 10px; margin: 5px">
<b>   Reviewer's comment </b>
   
Try not to make your project (both text and code) look like AI-generated.
</div>    
<div style="border: 5px solid red; padding: 10px; margin: 5px">
<b>   Reviewer's comment </b>
    
Don't forget to update it if needed. 

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [x]  Code is error free
- [x]  The cells with the code have been arranged in order of execution
- [x]  The data has been downloaded and prepared
- [x]  The models have been trained
- [x]  The analysis of speed and quality of the models has been performed

## 🧾 Project Conclusion

We developed a machine learning model to predict the market value of used cars based on features like brand, mileage, fuel type, and engine power. After preparing the data and comparing multiple models, we found that:

- **Tuned LightGBM** offered the best overall performance.
- It had the lowest validation RMSE (~1553.90) and a strong final test RMSE (~1605.29).
- Prediction time was quick (~0.27 seconds), making it practical for real-world use.

To ensure fair evaluation, we:
- Removed outliers and missing values.
- Split the dataset into training, validation, and test sets to prevent data leakage.
- Applied different encoding methods:
  - `OneHotEncoder` for linear models.
  - `OrdinalEncoder` for tree-based models.
  - Native category handling for LightGBM.

We also included a **Dummy Regressor** as a baseline to confirm that our models actually learned useful patterns.

In the end, LightGBM struck the best balance between accuracy and speed, making it well-suited for deployment in pricing systems.




<div style="border: 5px solid green; padding: 10px; margin: 5px">
<b>   Reviewer's comment Iter 2</b>
    
Congrats on such an excellent project, Taylor!