# Applying Machine Learning Models

In [16]:
import pandas as pd

train_file_path = "Updated_TrainingData.xlsx"  
test_file_path = "Updated_TestData.xlsx" 
train_data = pd.read_excel(train_file_path)
test_data = pd.read_excel(test_file_path)

X_train = train_data.drop(columns=['Sourcing Cost'])  
y_train = train_data['Sourcing Cost']  

X_test = test_data.drop(columns=['Sourcing Cost'])  
y_test = test_data['Sourcing Cost']  

print("Training dataset shape:", X_train.shape)
print("Test dataset shape:", X_test.shape)

Training dataset shape: (528493, 8)
Test dataset shape: (96, 8)


### RandomForestRegressor Model:

- **High Accuracy:** RandomForestRegressor typically yields high accuracy in prediction tasks due to its ability to capture complex non-linear relationships in the data, making it suitable for multivariate datasets.

- **Robust to Overfitting:** By constructing multiple decision trees and averaging their predictions, RandomForestRegressor is less prone to overfitting compared to individual decision trees, ensuring better generalization performance.

- **Feature Importance:** RandomForestRegressor provides a feature importance score, allowing you to understand which features have the most significant impact on predictions, aiding in feature selection and interpretation of results.


In [18]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf_model_tuned = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=5, min_samples_leaf=2)

rf_model_tuned.fit(X_train, y_train)

rf_preds_tuned = rf_model_tuned.predict(X_test)

rf_rmse_tuned = mean_squared_error(y_test, rf_preds_tuned, squared=False)
print("Tuned Random Forest RMSE:", rf_rmse_tuned)

Tuned Random Forest RMSE: 90.03733081700386




### XGBRegressor Model:

- **Excellent Performance:** XGBoost is known for its exceptional performance in both speed and accuracy, making it well-suited for handling complex datasets efficiently.

- **Regularization Techniques:** XGBoost includes built-in regularization techniques such as L1 and L2 regularization, which help prevent overfitting and improve generalization performance.

- **Feature Importance:** XGBRegressor provides feature importance scores, enabling you to identify the most influential features in your dataset and interpret the model's predictions effectively.


In [17]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

xgb_model = XGBRegressor()

xgb_model.fit(X_train, y_train)

xgb_preds = xgb_model.predict(X_test)

xgb_rmse = mean_squared_error(y_test, xgb_preds, squared=False)
print("XGBoost RMSE:", xgb_rmse)

XGBoost RMSE: 84.35457812098922




### LGBMRegressor Model:

1. **Efficient Training Speed**: LGBMRegressor is known for its fast training speed, making it efficient for handling large datasets and reducing computational time. This characteristic is particularly beneficial for training models on your multivariate dataset.

2. **High Accuracy**: Despite its speed, LGBMRegressor often achieves high accuracy in prediction tasks, thanks to its gradient boosting framework and advanced tree-based algorithms. This ensures reliable predictions for your dataset.

3. **Handling of Categorical Features**: LGBMRegressor naturally handles categorical features without the need for one-hot encoding. This simplifies the preprocessing step and allows for better utilization of the features in your multivariate dataset.


In [19]:
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

lgbm_model = LGBMRegressor()

lgbm_model.fit(X_train, y_train)

lgbm_preds = lgbm_model.predict(X_test)

lgbm_rmse = mean_squared_error(y_test, lgbm_preds, squared=False)
print("LightGBM RMSE:", lgbm_rmse)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010914 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 81
[LightGBM] [Info] Number of data points in the train set: 528493, number of used features: 8
[LightGBM] [Info] Start training from score 110.035407
LightGBM RMSE: 83.91287828231486




### VAR (Vector Auto Regression) Model:

1. **Multivariate Forecasting**: VAR model allows for multivariate forecasting, making it suitable for datasets with multiple correlated variables, like your dataset with multiple features.

2. **Capturing Dynamic Interactions**: VAR captures the dynamic interactions between variables over time, allowing for more nuanced predictions compared to univariate time series models.

3. **Flexible Forecasting Horizon**: VAR model provides flexibility in forecasting horizons, enabling predictions for multiple time steps ahead, which is valuable for planning and decision-making in your multivariate dataset.


In [20]:
from statsmodels.tsa.vector_ar.var_model import VAR
from sklearn.metrics import mean_squared_error

var_model = VAR(endog=X_train)

var_model_fitted = var_model.fit()

var_preds = var_model_fitted.forecast(y=X_train.values[-var_model_fitted.k_ar:], steps=len(X_test))

var_sourcing_cost_preds = var_preds[:, 0]

var_rmse = mean_squared_error(y_test, var_sourcing_cost_preds, squared=False)
print("Vector Auto Regression RMSE:", var_rmse)

Vector Auto Regression RMSE: 52.22280401960267




### Linear Regression Model:

1. **Simplicity and Transparency**: Linear Regression offers a simple and transparent modeling approach, making it easy to understand and interpret the relationship between the features and the target variable. This transparency is beneficial for explanatory analysis.

2. **Efficiency with Large Datasets**: Linear Regression can handle large datasets efficiently, making it suitable for datasets with multiple features and a considerable number of data points like yours. It does so without significant computational overhead.

3. **Baseline Performance**: Linear Regression provides a baseline performance metric against which more complex models can be compared. It serves as a starting point for model selection and evaluation in your multivariate dataset.


In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)

linear_reg_preds = linear_reg_model.predict(X_test)

linear_reg_rmse = mean_squared_error(y_test, linear_reg_preds, squared=False)
print("Linear Regression RMSE:", linear_reg_rmse)

Linear Regression RMSE: 45.34866317998951




### Multivariate State-Space Model:

1. **Dynamic and Flexible Modeling**: State-space models provide a dynamic and flexible framework for modeling time series data, allowing for the incorporation of exogenous variables like in your multivariate dataset. This capability enables capturing complex relationships between variables over time.

2. **Separation of Components**: State-space models decompose the time series into different components (e.g., trend, seasonality, and noise), providing insights into the underlying structure of the data. By modeling each component separately, these models enable more accurate predictions.

3. **Incorporation of Exogenous Variables**: State-space models can easily incorporate exogenous variables, allowing for the integration of external factors that may influence the time series behavior. This enhances the model's predictive capabilities for your multivariate dataset.


In [22]:
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error

model_state_space = sm.tsa.UnobservedComponents(y_train, exog=X_train)
results_state_space = model_state_space.fit()

state_space_preds = results_state_space.forecast(steps=len(X_test), exog=X_test)

state_space_rmse = mean_squared_error(y_test, state_space_preds, squared=False)
print("Multivariate State-Space RMSE:", state_space_rmse)

  warn("Specified model does not contain a stochastic element;"


Multivariate State-Space RMSE: 44.8530754718732




## Conclusion:

After evaluating all the best-known Machine Learning Models and analyzing their results, I have concluded that the Multivariate State-Space Model and Multiple Linear Regression provide the best outputs compared to all the other models.

### Performance Comparison:

- **Random Forest RMSE:** 90.04
- **XGBoost RMSE:** 84.35
- **LightGBM RMSE:** 83.91
- **Vector Auto Regression RMSE:** 52.22
- **Linear Regression RMSE:** 45.35 (Execution Time: 0.0 sec)
- **Multivariate State-Space RMSE:** 44.85 (Execution Time: 2 min 5 sec)

I also observed that Linear Regression is much faster compared to the Multivariate State-Space model, with execution times of 0.0 seconds and 2 minutes 5 seconds, respectively.

Based on the RMSE values, both Linear Regression and Multivariate State-Space models outperform other models, with Multivariate State-Space having a slightly lower RMSE. However, Linear Regression offers the advantage of faster execution.

These findings suggest that for this particular dataset, both Linear Regression and Multivariate State-Space models are suitable choices, depending on the trade-off between speed and performance.
