## Creation, training, and analyzation of Light Gradient Boosted Machine -model

### Importing libraries and loading the dataset

*1. Importing Libraries:* Essential Python libraries for data manipulation, mathematical operations, and machine learning are imported. These include pandas for data handling, numpy for numerical operations, various utilities from scikit-learn for model selection and metrics, and lightgbm for gradient boosted machine learning models.

*2. Loading the Dataset:* The dataset is loaded from a CSV file named 'final_data.csv' into a pandas DataFrame. The dataset has been preprocessed beforehand and is ready for machine learning modeling.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
from lightgbm.callback import early_stopping

# Load the dataset
df = pd.read_csv('final_data.csv')

### Modeling

In the cell below we create a LightGBM -machine learning model, that aims to predict the price of a vehicle based on various features. Here's a breakdown of what each section of the code does:

*1. Feature Selection:* The code starts by selecting the features (variables) to be used for predicting the vehicle price. These features include characteristics like body type, fuel economy, engine type, and more. The 'price' of the vehicle is set as the target variable that the model will predict.

*2. Data Preprocessing:* The features are divided into two types: categorical (like body type, engine type) and numerical (like mileage, horsepower). Categorical features are processed using One-Hot Encoding, a technique that converts categorical variables into a form that could be provided to machine learning algorithms to do a better job in prediction. The numerical features are passed through without changes.

*3. Pipeline Creation:* A pipeline for preprocessing categorical features is created. This pipeline applies One-Hot Encoding to these features. There's no transformation applied to numerical features in this pipeline.

*4. Data Splitting:* The dataset is split into three parts: training, validation, and testing sets. This is a common practice in machine learning, allowing the model to be trained on one set of data, tuned on another, and finally tested on unseen data.

*5. Preprocessing Application:* The preprocessing steps defined in the pipeline are applied to the training and validation data.

*6. Model Definition and Training:* The code defines a LightGBM regression model. LightGBM is a gradient boosting framework that uses tree-based learning algorithms. The model is trained with the preprocessed training data, and early stopping is used during training. Early stopping is a method to stop training the machine learning model when it's no longer improving on a validation dataset.

*7. Model Prediction:* Finally, the model predicts the 'price' for the test data. This step is crucial as it shows how well the model can perform on data it has never seen before.

In summary, this code represents the process of preparing data, training a machine learning model, and then using that model to make predictions. The goal is to accurately predict vehicle prices based on various attributes of the vehicles.

In [3]:
# Selecting the desired features for the model and the target variable
X = df[['body_type', 'city_fuel_economy', 'engine_type', 'exterior_color', 'fuel_tank_volume', 'fuel_type', 'highway_fuel_economy', 'horsepower', 'isCab', 'make_name', 'maximum_seating', 'mileage', 'model_name', 'seller_rating', 'torque', 'transmission', 'wheel_system', 'year', 'damage_history', 'major_options_count'
]]  # Features
y = df['price']  # Target variable

# One-Hot Encoding categorial features
categorical_features = ['body_type', 'engine_type', 'damage_history', 'fuel_type', 'isCab', 'make_name', 'transmission', 'wheel_system']
numerical_features = ['city_fuel_economy', 'highway_fuel_economy', 'exterior_color', 'fuel_tank_volume', 'horsepower', 'mileage', 'model_name', 'major_options_count', 'seller_rating', 'torque', 'year' ]

# Creating preprocessing pipelines for categorical features
categorical_pipeline = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # Applies One-Hot Encoding
])

# No transformation for numerical features in this pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_pipeline, categorical_features),
        ('num', 'passthrough', numerical_features)  # No changes to numerical features
    ]
)

# Splitting the data into training, validation, and testing sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Apply preprocessing to the training and validation data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_val_preprocessed = preprocessor.transform(X_val)
X_test_preprocessed = preprocessor.transform(X_test)

# Define the LightGBM model
model = lgb.LGBMRegressor(
    objective='regression', 
    n_estimators=21485,
    learning_rate=0.037,
    num_leaves=180, 
    max_depth=-1,
    n_jobs=7, 
    random_state=42,
    min_child_samples=6,
    #subsample=0.8,
    #colsample_bytree=0.8,
    #force_row_wise=True
    #force_col_wise=True
)

# Fit the model with early stopping using callback
model.fit(
    X_train_preprocessed, y_train, 
    eval_set=[(X_val_preprocessed, y_val)], 
    callbacks=[early_stopping(stopping_rounds=150, verbose=True)]
)

# Predicting the 'price' for the test data
y_pred = model.predict(X_test_preprocessed)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032671 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1907
[LightGBM] [Info] Number of data points in the train set: 1409224, number of used features: 108
[LightGBM] [Info] Start training from score 30691.202544
Training until validation scores don't improve for 150 rounds
Did not meet early stopping. Best iteration is:
[21457]	valid_0's l2: 8.25886e+06


### Evaluation of performance

Here we evaluate the performance of the machine learning model developed for predicting vehicle prices. It calculates and displays two key metrics: the Root Mean Squared Error (RMSE) and the R² Score.

*1. Calculating RMSE:* RMSE (Root Mean Squared Error) is a standard way to measure the error of a model in predicting quantitative data. It represents the square root of the average of the squared differences between the predicted values and the actual values. In simpler terms, RMSE tells us how much the prediction differs from actual price in average. A lower RMSE value indicates a better fit and suggests that the model is more accurate in its predictions.

*2. Calculating R² Score:* The R² Score, or the coefficient of determination, is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in a regression model. In other words, it tells you how much of the variance in the dependent variable (vehicle price, in this case) can be explained by the model. An R² Score of 1 indicates perfect correlation, whereas a score of 0 would indicate no correlation. Basically the R^2 Score tells us how accurate the model's predictions are.

*3. Printing Performance Metrics:* Finally, the code prints out these two metrics - RMSE and R² Score. This is an important step for understanding how well the model performs. By reviewing these metrics, one can assess the accuracy and reliability of the model's predictions.


In [4]:
# Calculating the performance of the predictions
rmse = np.sqrt(mean_squared_error(y_test,y_pred))
r2 = r2_score(y_test, y_pred)

# Printing performance metrics
print(f'Root Mean Squared Error: {rmse}')
print(f'R^2 Score: {r2}')

Root Mean Squared Error: 2889.584650229301
R^2 Score: 0.9679221238931831


### Comparing predictions to actual prices

This part of the code is focused on comparing the predicted vehicle prices from the machine learning model with the actual prices from the dataset.

*1. Creating a Comparison DataFrame:* The code starts by creating a new DataFrame named comparison_df. This DataFrame contains two columns: 'Actual Price' (the real price of the vehicles) and 'Predicted Price' (the price predicted by the model). This direct comparison allows for an immediate visual assessment of the model's performance.

*2. Calculating Price Differences:* The code then adds two new columns to comparison_df. The first new column, 'Difference', is calculated as the predicted price minus the actual price, showing the absolute error for each prediction. The second column, 'Difference%', represents the percentage difference between the predicted and actual prices. This percentage is a relative measure of the prediction error, giving context to how significant the differences are in comparison to the actual prices.

*3. Generating a Random Sample:* To get a more manageable view of the data, the code generates a random sample of 15 entries from the comparison DataFrame. This sample provides a snapshot of the model's prediction accuracy across different data points without overwhelming the user with the entire dataset.

*4. Sorting the Data:* The DataFrame is then sorted by 'Difference%' in descending order. This sorting brings the predictions with the biggest discrepancies (i.e., the largest errors) to the top of the DataFrame. It's a useful way to quickly identify where the model's predictions were most and least accurate.

*5. Resetting the Index:* For better readability, the index of the DataFrame is reset. This step makes it easier to reference and read the rows.

*6. Displaying the Results:* Finally, this random sample is printed out. This display is useful for a quick, random inspection of how well the model's predictions align with actual prices in different scenarios.

In summary, this part of the code is crucial for a practical assessment of the machine learning model's accuracy in predicting vehicle prices. It not only highlights the overall prediction errors but also provides a detailed and randomly sampled view of these errors, offering insights into the model's performance in real-world scenarios.

In [317]:
# Comparing predicted prices with actual prices
comparison_df = pd.DataFrame({'Actual Price': y_test, 'Predicted Price': y_pred})
comparison_df['Difference'] = comparison_df['Predicted Price'] - comparison_df['Actual Price']
comparison_df['Difference%'] = np.abs(comparison_df['Difference'] / comparison_df['Actual Price'] * 100)

# Generate a random sample from the comparison dataframe
random_comparison_sample = comparison_df.sample(n=15, random_state=None)

# Sort the random sample by 'Difference%' in descending order for display
random_comparison_sample_sorted = random_comparison_sample.sort_values(by='Difference%', ascending=False)

# Reset index for the sorted sample for better readability
random_comparison_sample_sorted.reset_index(drop=True, inplace=True)

# Display the sorted random sample
print(random_comparison_sample_sorted)

    Actual Price  Predicted Price    Difference  Difference%
0         8499.0     11746.785354   3247.785354    38.213735
1        55953.0     65974.312593  10021.312593    17.910233
2        17900.0     15348.817642  -2551.182358    14.252415
3        22941.0     20783.969853  -2157.030147     9.402511
4        51852.0     47647.115102  -4204.884898     8.109398
5        19500.0     18192.817189  -1307.182811     6.703502
6        61362.0     57663.064630  -3698.935370     6.028055
7        49406.0     46776.914103  -2629.085897     5.321390
8        30802.0     32198.098746   1396.098746     4.532494
9        21192.0     22142.873143    950.873143     4.486944
10       45055.0     44203.803455   -851.196545     1.889239
11       17408.0     17226.484485   -181.515515     1.042713
12       21995.0     22110.855509    115.855509     0.526736
13       16795.0     16847.902602     52.902602     0.314990
14       90255.0     90128.016585   -126.983415     0.140694
