# Modeling Stage Report in the CRISP-DM Model

* Author: Aleksandr Baranov
* Date: 9.12.2023

The Modeling stage is the fourth step in the CRISP-DM (Cross-Industry Standard Process for Data Mining) Model. This stage involves the development and evaluation of models to address the business problem. Here is a summary of the key activities:

## Selection of Modeling Methods:
The choice of modeling methods depends on the nature of the problem and the type of data. Some common methods include:

* Linear Regression: Suitable for predicting numeric values based on linear relationships between variables.

* Logistic Regression: Applied for classification tasks, especially binary classification.

* Decision Trees and Random Forests: Suitable for classification and regression tasks, capable of capturing non-linear relationships.

* Support Vector Machines (SVM): Effective in separating data into classes.

* Neural Networks: Useful for complex tasks requiring the learning of intricate patterns.

* LightGBM is a gradient boosting algorithm that excels in regression and classification tasks. 

We also tested random forest, linear regression, catboost, xgboost and Lightgbm models, of which Lightgbm was found to be the best for our task.

It possesses the following characteristics:

#### Regression Tasks:

LightGBM is well-suited for predicting numeric values, as it can capture complex nonlinear relationships between variables. This makes it a powerful tool for estimating prices, for example, based on the features of automobiles.
Classification Tasks:

LightGBM also performs exceptionally well in classification tasks, including binary and multiclass scenarios. Its ability to handle large datasets and identify intricate patterns makes it effective for tasks like classifying types of vehicles or determining product categories.

#### Efficiency and Speed:

LightGBM is known for its efficiency and speed. It employs gradient boosting techniques and is tree-based, allowing it to efficiently learn from large volumes of data and achieve high accuracy.

#### Handling Categorical Features:

The algorithm supports categorical features, making it convenient for working with datasets that contain various types of variables.

#### Flexible Parameter Tuning:

LightGBM allows for tuning various hyperparameters, such as the number of trees, learning rate, and tree depth, making it flexible for different scenarios and tasks.

In summary, LightGBM is a powerful machine learning algorithm successfully applied to various regression and classification tasks, particularly when efficiency and the ability to handle large datasets are crucial.

## Data Splitting:

The data were divided into three sets: the training set, the validation set, and the test set. 

#### Training Set:

X_train and y_train contain data used for training the model. The model learns the relationships between input features (X_train) and the target variable (y_train).

#### Temporary Set:

X_temp and y_temp serve as a temporary dataset that will be further split into the validation and test sets.

#### Validation and Test Sets:

X_val, y_val represent the validation set used to assess the model's performance during training and hyperparameter tuning.
X_test, y_test constitute the test set, which remains "frozen" until the end of the process and is used for the final evaluation of the model's performance.

#### Data Splitting:

train_test_split is used twice: first to split the entire dataset (X and y) into the training and temporary sets (with a 60%-40% ratio), and then to split the temporary set into the validation and test sets (with a 50%-50% ratio).

The overall approach to dividing the data into training, validation, and test sets allows for an effective evaluation and fine-tuning of the model, minimizing the risk of overfitting, and providing an unbiased assessment of its performance on new data.

## Model Building:

Features and the target variable are selected from the dataset (X and y). The features include both categorical and numerical variables related to vehicles.
Categorical features are prepared for the model using one-hot encoding, creating a preprocessing pipeline (categorical_pipeline).
A ColumnTransformer (preprocessor) is set up to handle both categorical and numerical features separately.

#### Model Parameters:

* Objective: 'regression' - Indicates that the model is designed for regression tasks.
* n_estimators: 21485 - The number of boosting rounds or trees to be built.
* learning_rate: 0.037 - The step size shrinkage to prevent overfitting.
* num_leaves: 180 - Maximum number of leaves in one tree.
* max_depth: -1 - No limit on the depth of the tree.
* n_jobs: 7 - The number of parallel threads used for training.
* random_state: 42 - Seed for reproducibility.
* min_child_samples: 6 - Minimum number of data points in a leaf.

## Model Training:
The model is trained using the fit method, where the preprocessed training data (X_train_preprocessed and y_train) are used. Early stopping is implemented with a callback that monitors the validation set (X_val_preprocessed and y_val). The training process will stop if there is no improvement in the evaluation metric (default is mean squared error) on the validation set for 150 consecutive rounds.

This approach helps prevent overfitting and ensures the model is halted when its performance on the validation set ceases to improve. The training progress is displayed with verbosity during the early stopping process.

## Model Evaluation:

Training progress information is provided, including the number of rounds and the best iteration based on validation set performance.
Model evaluation metrics such as Root Mean Squared Error (RMSE) and R-squared (R^2) are calculated using the test set to assess the model's predictive performance.

## Selection of the Best Model:

The calculated performance metrics (RMSE and R^2) provide insights into how well the model predicts prices.
The model's predictions are compared to the actual prices, and a DataFrame (comparison_df) is created to analyze differences and percentages of differences.
The DataFrame is sorted by the difference percentage, and a random sample (random_comparison_sample) is generated to inspect predictions with the largest discrepancies.

## Results:

The RMSE is approximately 2889.58, indicating the average prediction error.
The R^2 score is approximately 0.968, suggesting a high level of variance explained by the model.
The comparison DataFrame shows a sample of actual prices, predicted prices, differences, and difference percentages, providing insights into the model's accuracy.
The provided results offer a comprehensive view of the model's predictive capabilities and areas where it might need improvement.

## Conclusions:

The Modeling stage utilizing LightGBM has concluded successfully. The model exhibited high predictive accuracy, making it suitable for further deployment in solving the specified business problem.