<div style="border-radius: 15px; border: 3px solid indigo; padding: 15px;">
<b> Reviewer's comment</b>
    
Hi, I am a reviewer on this project. Congratulations on submitting another project! 🎉
    

Before we start, I want to pay your attention to the color marking:
    

   
    
<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">

Great solutions and ideas that can and should be used in the future are in green comments. Some of them are: 
    
    
- You have successfully read the data;
    

    
- Used the correct way to encode categorical columns;    
       
    
    
- You have trained and compared several models, great!

    
    
- Measured their training speed, good! 

</div>
    
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Yellow color indicates what should be optimized. This is not necessary, but it will be great if you make changes to this project. I've left several recommendations throughout the project. Please take a look.
 
</div>
<div class="alert alert-danger" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<b> Reviewer's comment </b>

Issues that must be corrected to achieve accurate results are indicated in red comments. Please note that the project cannot be accepted until these issues are resolved. For instance,



- Please try to explore the distributions and add conclusions; 

    
    
- Check the data for the duplicates after you drop columns. 
    
    
    
- Please split the data first, only then we need to encode it.


    
- According to the task, we are supposed to measure models' prediction time as well. Would you update the code?


    
- Please add a conclusion about each model. How do they perform?

    

    
- We also need to tune hyperparameters. We tune them to identify the hyperparameters that will yield the desired metric value. Would you try to implement it?  



    
- In the very end of the project, choose the best model (the one that yielded the best RMSE) and run the final test. 


 
    
- Add the final conclusion please. A well-written conclusion shows how the project met its objectives and provides a concise and understandable summary for those who may not have been involved in the details of the project. 


There may be other issues that need your attention. I described everything in my comments.  
</div>        
<hr>
    
<font color='dodgerblue'>**To sum up:**</font> you did a great job here, thank you so much! The updates should not take much time. If you have any questions, please feel free to ask. I will wait the project for the second review 😊 
    

<hr>
    
Please use some color other than those listed to highlight answers to my comments.
I would also ask you **not to change, move or delete my comments** to make it easier for me to navigate during the next review.
    
<hr> 
    
✍️ Here's a link to [Supervised Learning documenation sections](https://scikit-learn.org/stable/supervised_learning.html) that you may find useful.
    
<hr>
    
    
📌 Please feel free to schedule a 1:1 sessions with our tutors or TAs Feel free to book 1-1 session [here](https://calendly.com/tripleten-ds-experts-team), join daily coworking sessions, or ask questions in the sprint channels on Discord if you need assistance 😉 
</div>

Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value. 

Rusty Bargain is interested in:

- the quality of the prediction;
- the speed of the prediction;
- the time required for training

<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Good introduction! 
    
</div>
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2>   Reviewer's comment </h2>
    
Please don't forget about project title :) A title should reflect the core goals.
    
</div>

## Data preparation

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

In [2]:
print("Loading data...")
data = pd.read_csv('/datasets/car_data.csv')

data = data.drop(columns=['NumberOfPictures', 'PostalCode'])

categorical_features = ['VehicleType', 'Gearbox', 'Model', 'FuelType', 'Brand', 'NotRepaired']
for col in categorical_features:
    data[col] = data[col].fillna('unknown')

data = data[(data['RegistrationYear'] >= 1900) & (data['RegistrationYear'] <= 2023)]

data = data.drop_duplicates()

target = data['Price']
features = data.drop(columns=['Price'])

features_full_train, features_test, target_full_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)

features_train, features_valid, target_train, target_valid = train_test_split(
    features_full_train, target_full_train, test_size=0.25, random_state=12345)

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
ohe.fit(features_train[categorical_features])

def encode(df):
    encoded = pd.DataFrame(ohe.transform(df[categorical_features]), index=df.index)
    encoded.columns = ohe.get_feature_names(categorical_features)
    df = df.drop(columns=categorical_features).reset_index(drop=True)
    encoded = encoded.reset_index(drop=True)
    return pd.concat([df, encoded], axis=1)

features_train = encode(features_train)
features_valid = encode(features_valid)
features_test = encode(features_test)

Loading data...


<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
Great choice! `OneHotEncoder(handle_unknown='ignore')` or `OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)` are generally more robust than `get_dummies` because they can handle situations where test subset has features that were not available during training. [Difference between OneHotEncoder and get_dummies](https://pythonsimplified.com/difference-between-onehotencoder-and-get_dummies/). 
    
    
    
For tree-based models, `OrdinalEncoder` is a better choice because of computational cost. For boosting algorithms, we can rely on internal encoders that usually perform even better than external ones. For `CatBoost`, this is controlled by the `cat_features` parameter. For `LightGBM`, you can convert categorical features to the category type, allowing the model to handle them automatically.
    
    
    
Please note that `OrdinalEncoder()` should not be used with linear models if there's no ordinal relationship. [How and When to Use Ordinal Encoder](https://leochoi146.medium.com/how-and-when-to-use-ordinal-encoder-d8b0ef90c28c).


</div><div class="alert alert-danger" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
- Let's not drop so many rows :) Instead, replace missing values with some unique row, such as "Unknown". 



- Are there any outliers in the data?  Please call the `describe` method and display charts. Drop abnormal values if they exist. 



- Please check max dates in the `RegistrationYear` and  `DateCrawled` columns. Vehicle should not be registered after the data was downloaded :) 



- There are other columns that can be dropped. After removing unnecessary columns, it makes sense to check the data for duplicates again, as the dataset will later be splitted into training and test sets. Removing specific columns can cause previously distinct rows to become identical. If a dropped column contained unique values (ID or timestamp), removing it may make multiple rows appear the same.   



- We should encode data after we split it to avoid data leakage. Would you fix it please? </div>

## Model training

In [None]:
def train_and_evaluate(model, name):
    start_time = time.time()
    model.fit(features_train, target_train)
    train_time = time.time() - start_time
    
    predictions = model.predict(features_valid)
    rmse = mean_squared_error(target_valid, predictions, squared=False)
    
    print(f"{name} - RMSE: {rmse:.2f}, Training time: {train_time:.2f} sec")
    del model  # Clear model from memory after training
    gc.collect()  # Force garbage collection to free up memory

# Set up your models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=12345),
    "Random Forest": RandomForestRegressor(n_estimators=100, random_state=12345),
    "LightGBM": LGBMRegressor(n_estimators=10, random_state=12345)  # Reduce n_estimators for testing
}

# Make sure to only call models after you finish preprocessing
for name, model in models.items():
    train_and_evaluate(model, name)


<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
Well done! 

</div>

## Model analysis

In [None]:
print("\nModel Performance Summary:")
for name, model in models.items():
    start_time = time.time()
    model.fit(features_train, target_train)
    train_time = time.time() - start_time
    
    predictions = model.predict(features_valid)
    rmse = mean_squared_error(target_valid, predictions, squared=False)
    print(f"{name}: RMSE = {rmse:.2f}, Training time: {train_time:.2f} sec")



<div class="alert alert-success" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
Good! I'll run your code on my local machine, since the kernel here cannot handle heavy projects. 

</div>
<div class="alert alert-warning" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
You can use the function you defined above.

</div>
<div class="alert alert-danger" style="border-radius: 15px; box-shadow: 4px 4px 4px; border: 1px solid ">
<h2> Reviewer's comment </h2>
    
- According to the task, we also need to estimate the prediction time. Would you try?


- Please try to tune hyperparameters for at least one of the models. If you decide to use a loop, don't forget to change the way you split the data, because in this case we will need three subsets, not two. 
    
    
- It will be perfect if you add a conclusion about each model. How do they perform?

    
- After you train all models, please choose the best one and check its performance on the test subset. \




- Let's add the overall conclusion to your project: what has been done and what can be inferred from the results. 

</div>

# Checklist

Type 'x' to check. Then press Shift+Enter.

- [x]  Jupyter Notebook is open
- [ ]  Code is error free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The models have been trained
- [ ]  The analysis of speed and quality of the models has been performed

<div class="alert alert-info">
  hey if the code has an error im sorry but i keep getting a dead kernal 
</div>