<div style="border: 5px solid purple; padding: 15px; margin: 5px">
<b> Reviewer's comment 3</b>


Thank you for the updates! You can find my new comments with digit 3. You did a great job here! You learned how to build and evaluate models to predict used car prices. You have successfully conducted EDA, handled missing values and outliers. You trained and tuned several models, compared their RMSE and speed, and chose the best model for the final testing. You learned how to prepare and encode large data and how to weigh training speed vs. prediction error, and why this matters in real-world applications. I hope you enjoyed this project! I do not have any questions, so the project can be accepted. Good luck! 

    
</div>

<div style="border: 5px solid purple; padding: 15px; margin: 5px">
<b> Reviewer's comment 2</b>


Thank you for submitting the project! I appreciate the time you took to update it!  I've left a couple of new comments with digit 2. Would you please take a look? 

    
</div>

<div style="border: 5px solid purple; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>
    
Hi Joshua! Congratulations on submitting another project! 🎉 I will be using the standard the color marking:
    

   
    
<div style="border: 5px solid green; padding: 15px; margin: 5px">

Great solutions and ideas that can and should be used in the future are in green comments. Some of them are: 
    
    
- You have successfully prepared the subsets. It is important to split the data correctly in order to ensure there's no intersection;    
    

    
- Encoded cetegorical columns;    
 
    
- Trained and compared several models, great!

    
- Measured their training and prediction speed.
   


- Analyzed metrics. It is not enough to just fit the model and print the result. Instead, we have to analyze the results as it helps us identify what can be improved;

</div>
    
<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b> Reviewer's comment </b>

Yellow color indicates what should be optimized. This is not necessary, but it will be great if you make changes to this project. I've left several recommendations throughout the project. Please take a look.
 
</div>
<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b> Reviewer's comment </b>

Issues that must be corrected to achieve accurate results are indicated in red comments. Please note that the project cannot be accepted until these issues are resolved. For instance,


- Please try to explore the distributions and add conclusions. In real-world problems, the data is rarely clean. Displaying distributions help us evaluate the data, find outliers, identify the required preprocessing steps and understand feature relationships, which informs feature engineering. Feature engineering in some cases is a clue; 

    
- There are several columns that can also be dropped. Please consider removing them to reduce computational cost;
    

    
- Check the data for the duplicates after you drop columns; 
    
    
- Please split the data first, only then we need to scale or encode it to avoid data leakage;
  
    
    
- Please note that we are solving a regression task.


    
- We also need to tune hyperparameters. We tune them to identify the hyperparameters that will yield the desired metric value. Would you try to implement it?  


    
- In the very end of the project, choose the best model (the one that yielded the best RMSE and good prediction speed or 2 if they have the same metric values) and run the final test;

      


There may be other issues that need your attention. I described everything in my comments.  
</div>        
<hr>
    
<font color='dodgerblue'>**To sum up:**</font> great job here! You demonstrated strong analytical and modeling skills by preparing the data, experimenting with multiple advanced models, and evaluating them with appropriate metrics. The conclusion clearly communicates which model offers the best trade-off between speed and RMSE. There are just several issues that need your attention. Please take a look at my comments and do not hesitate to ask questions if some of them seem unclear. I will wait the project for the second review 😊 
    

<hr>
    
Please use some color other than those listed to highlight answers to my comments.
I would also ask you **not to change, move or delete my comments** to make it easier for me to navigate during the next review.
    
<hr> 
    
✍️ Here's a link to [Supervised Learning documenation sections](https://scikit-learn.org/stable/supervised_learning.html) that you may find useful.
    
<hr>
    
    
📌 Please feel free to schedule a 1:1 sessions with our tutors or TAs Feel free to book 1-1 session [here](https://calendly.com/tripleten-ds-experts-team), join daily coworking sessions, or ask questions in the sprint channels on Discord if you need assistance 😉 
</div>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

# Import Libaries 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd 
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostRegressor
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
import time
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RandomizedSearchCV

## Data preparation

In [None]:
#Load the data and check for obvious issues 
df = pd.read_csv('/datasets/car_data.csv')

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Fill missing values with "Unknown" for categorical columns
df_filled = df.fillna("Unknown")

# Display basic statistics
describe_df = df_filled.describe(include='all')

# Plot histograms for 'price' and 'powerPS'
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(df_filled['Price'], bins=100, kde=True)
plt.title('Price Distribution')
plt.xlabel('Price')

display("Dataset Description",describe_df)

In [None]:
# Check for missing values and data types
missing_summary = df.isnull().sum()
data_types = df.dtypes

# Combine both into a summary DataFrame
summary = pd.DataFrame({
    "Missing Values": missing_summary,
    "Data Type": data_types
}).sort_values(by="Missing Values", ascending=False)

display("Missing Values and Data Types Summary",summary)

In [None]:
# Define reasonable limits for 'Price' and 'Power' based on domain knowledge
price_min, price_max = 100, 150000  # Remove extremely low or high prices
power_min, power_max = 10, 1000     # Remove implausible power values

# Filter out the outliers
df_no_outliers = df_filled[
    (df_filled['Price'].astype(float).between(price_min, price_max)) &
    (df_filled['Power'].astype(float).between(power_min, power_max))
]

# Drop unnecessary columns
columns_to_drop = ['LastSeen', 'DateCreated', 'RegistrationMonth', 'PostalCode', 'NumberOfPictures','DateCrawled']
df_reduced = df_no_outliers.drop(columns=columns_to_drop)

# Check for duplicates after column removal
duplicates_count = df_reduced.duplicated().sum()

df_reduced_cleaned = df_reduced.drop_duplicates()

{
    "Initial Rows": len(df),
    "Rows After Outlier Removal": len(df_no_outliers),
    "Rows After Removing Unnecessary Columns and Duplicates": len(df_reduced_cleaned),
    "Duplicates Removed": duplicates_count
}


<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment 2 </b>
    
Excellent job in this section! 

</div>

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
The data was successfully read, well done! 
    
</div><div style="border: 5px solid red; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>
    
- Let's not drop so many rows :) Instead, replace missing values with some unique row, such as "Unknown". 


- Are there any outliers in the data?  Please call the `describe` method and display charts. Drop abnormal values if they exist. Hint: `price` and `power` columns definitely have outliers.



   
- There are columns that should be deleted to reduce computational cost. These are:  `LastSeen`, `DateCreated`, `RegistrationMonth`, `PostalCode` and `NumberOfPictures`. 
    
    
 
- After removing unnecessary columns, it makes sense to check the data for duplicates again, as the dataset will later be splitted into training and test sets. Removing specific columns can cause previously distinct rows to become identical. If a dropped column contained unique values (ID or timestamp), removing it may make multiple rows appear the same.

</div>
<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>



- You can then drop `RegistrationYear`. It will significantly simplify the training process.

  
    
- It will be perfect if you analyze the distributions and display the charts, thus showing a reader why you decided to delete specific rows.



- Consider comparing max dates in the `RegistrationYear` and  `DateCrawled` columns. Vehicle should not be registered after the data was downloaded :) 


- Consider analyzing categories as well. Petrol and gasoline refer to the same fuel, so we can use one of these categories. There are some other rare fuel types that can be dropped. If a category appears only in the training or validation subset, for instance, and we use `handle_unknown='ignore'`, the linear model might miss important signals in validation or make predictions with incomplete features thus breaking the assumptions of linearity. It may be helpful to make sure that training and validation subsets use the same feature columns after encoding. 


- Another option is to drop `VehicleType` and `Brand`, since we have `Model` that should reflect both. 
</div>

## Model training

In [None]:
# Split the data again for regression
X_full = df_reduced_cleaned.drop('Price', axis=1)
y_full = df_reduced_cleaned['Price']

# Split into train+valid and test
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X_full, y_full, test_size=0.20, random_state=42)

# Show sizes of each subset
{
    "Total Samples": len(df_reduced_cleaned),
    "Train+Valid Samples": len(X_train_valid),
    "Test Samples": len(X_test)
}
# Step 2: Split train+valid into train and validation subsets (75% train, 25% valid)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size=0.25, random_state=42)

# Step 3: Encode categorical features only on train, valid, and test subsets
def encode_with_ordinal(train_df, valid_df, test_df):
    enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
    cat_cols = train_df.select_dtypes(include='object').columns

    train_df = train_df.copy()
    valid_df = valid_df.copy()
    test_df = test_df.copy()

    train_df[cat_cols] = enc.fit_transform(train_df[cat_cols])
    valid_df[cat_cols] = enc.transform(valid_df[cat_cols])
    test_df[cat_cols] = enc.transform(test_df[cat_cols])

    return train_df, valid_df, test_df, enc  # use train encoders on test set

# Apply optimized encoding
X_train_enc, X_valid_enc, X_test_enc, encoder = encode_with_ordinal(X_train, X_valid, X_test)

# Step 4: Train regression models and evaluate RMSE
models = {
    "Random Forest": RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    "Linear Regression": LinearRegression(),
    "CatBoost": CatBoostRegressor(iterations=100, learning_rate=0.1, depth=6, verbose=0, random_seed=42),
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
}
rmse_results = {}
for name, model in models.items():
    model.fit(X_train_enc, y_train)
    preds = model.predict(X_valid_enc)
    rmse = mean_squared_error(y_valid, preds, squared=False)
    rmse_results[name] = rmse

# Sort results
rmse_results_sorted = dict(sorted(rmse_results.items(), key=lambda x: x[1]))
rmse_results_sorted

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
Yes, we need to encode data here, well done! 
    
</div>
<div style="border: 5px solid gold; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
`OrdinalEncoder()` or `LabelEncoder()` should not be used with linear models if there's no ordinal relationship. [How and When to Use Ordinal Encoder](https://leochoi146.medium.com/how-and-when-to-use-ordinal-encoder-d8b0ef90c28c). For linear regresison, I recommend using `OneHotEncoder(handle_unknown='ignore')`. 



</div>

<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>


- We are solving a regression task here, and our metric is RMSE.  

  
    
- We have to encode data after we split it, not before.


    
- Please split the data into three subsets, as we need to save at least one subset for the final testing. You only need to introduce the test subset here that you will use on the final test. 
</div>

In [None]:
# Evaluate models for speed and quality
performance_regression = []

for name, model in models.items():
    start_time = time.time()
    model.fit(X_train_enc, y_train)
    train_time = time.time() - start_time

    preds_valid = model.predict(X_valid_enc)
    rmse = mean_squared_error(y_valid, preds_valid, squared=False)

    performance_regression.append({
        "Model": name,
        "RMSE": round(rmse, 2),
        "Training Time (s)": round(train_time, 4)
    })

# Convert to DataFrame and display sorted by RMSE
performance_df_regression = pd.DataFrame(performance_regression).sort_values(by="RMSE")

display("Model Speed and Quality (Regression)",performance_df_regression)


In [None]:
prediction_speeds = []

# Fit and time predictions
for name, model in models.items():
    model.fit(X_train_enc, y_train)
    start_time = time.time()
    _ = model.predict(X_valid_enc)
    prediction_time = time.time() - start_time
    prediction_speeds.append({
        "Model": name,
        "Prediction Time (s)": round(prediction_time, 4)
    })

# Display results
prediction_speed_df = pd.DataFrame(prediction_speeds).sort_values(by="Prediction Time (s)")

display("Prediction Speed Comparison",prediction_speed_df)

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment 3 </b>
    
Good! You can also display them both :)
</div><div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment 2 </b>
    
Great! 
</div>
<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b>   Reviewer's comment 2 </b>
    
However, we are also asked to calculate the prediction speed. Would you please add it? 
</div>

In [None]:
# Combine train and valid sets
X_train_full = pd.concat([X_train_enc, X_valid_enc])
y_train_full = pd.concat([y_train, y_valid])

# Step 1: Train default XGBoost
default_xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)
default_xgb.fit(X_train_full, y_train_full)
default_preds = default_xgb.predict(X_test_enc)
default_rmse = mean_squared_error(y_test, default_preds, squared=False)

# Step 2: Hyperparameter tuning
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'reg_alpha': [0, 0.1, 1],
    'reg_lambda': [1, 1.5, 2]
}

random_search = RandomizedSearchCV(
    estimator=XGBRegressor(random_state=42, n_jobs=-1),
    param_distributions=param_dist,
    n_iter=20,
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train_enc, y_train)
best_xgb = random_search.best_estimator_
best_xgb.fit(X_train_full, y_train_full)
tuned_preds = best_xgb.predict(X_test_enc)
tuned_rmse = mean_squared_error(y_test, tuned_preds, squared=False)

# Compare and return result
{
    "Default XGBoost Test RMSE": round(default_rmse, 2),
    "Tuned XGBoost Test RMSE": round(tuned_rmse, 2),
    "Best Model": "Default XGBoost" if default_rmse < tuned_rmse else "Tuned XGBoost"
}
# Create a results table
results = pd.DataFrame([
    {"Model": "Default XGBoost", "Test RMSE": round(default_rmse, 2)},
    {"Model": "Tuned XGBoost", "Test RMSE": round(tuned_rmse, 2)}
])

# Sort by RMSE
results = results.sort_values(by="Test RMSE")

# Display
print("Test RMSE Comparison:")
print(results)

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment 3 </b>
    
Makes sense! I had to run your project on my local machine because of the Kernel issues.
</div>

<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b>   Reviewer's comment 2 </b>
    
For the final testing, we need to choose the best model among all models, not the best hyperparameters. XGBoost and tuned XGBoost are two different models. Moreover, it is possible that the default hyperparameter values (XGBoost model) perform better. So please compare their RMSE first and then choose the best mode :) 
</div>

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
Good! 

</div><div style="border: 5px solid red; padding: 15px; margin: 5px">
<b> Reviewer's comment</b>
    

- Let's not repeat the code.

  

- Please try to tune hyperparameters for at least one of the models except for Linear Regression. For this purpose, use `RandomizedSearchCV` or `GridSearchCV`. 
    
    
- After you train all models, please choose the best **one** and check its performance on the test subset. Here we only need to make predictions and calculate RMSE. For the final testing, where we use the test subset to check the model's generalization ability, we should use the best model (one model or two models if they have almost the same metric values). We don't use all models here because even just checking their performance influences our choices. This leads to test set leakage when we unconsciously start picking models that perform well on the test set, making it part of the training loop. In real-world scenarios, the test set is meant to reflect how the final model performs in the wild. In practice, you only deploy one model, not several models, so testing just that final one mirrors reality. Moreover, evaluating every tuned model on the test set (especially with big models or datasets) is expensive and time-consuming. 




- When choosing the best model, please consider prediction time as well. The best model isn't always the one with the lowest error. Sometimes the errors are only slightly different, but the prediction time varies significantly. In such cases, it's worth considering a faster model. Think of a slow search engine that finds 10 useful links versus a fast one that finds 9. This is especially important if the model needs to operate in real time and produce results repeatedly. If a program runs just once, its speed might not even matter. But if it’s used continuously, optimization becomes crucial. So, in practice, apart from the other requirements, there are also runtime constraints for the model.</div>

## Model analysis

Final Test Evaluation
The tuned XGBoost model was retrained on the full training + validation set.

On the final held-out test set, it achieved:

Test RMSE: ~1724.24, confirming excellent generalization performance.

Conclusion
XGBoost emerged as the best model in terms of accuracy and generalizability.

Hyperparameter tuning further optimized its performance, validating the importance of model refinement.

This model is well-suited for deployment in real-world pricing applications, offering both precision and scalability.

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment 3 </b>
    
Good results! 
</div>

<div style="border: 5px solid green; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>


    
Great conclusion! This is a solid final summary with comparison across models. 
    
    
</div>    
 
<div style="border: 5px solid red; padding: 15px; margin: 5px">
<b>   Reviewer's comment </b>
    
Don't forget to update it if needed. 

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed