# **Walmart's Inventory Management and Demand Forecasting**

*Our project aims to develop an accurate sales forecasting model for various Walmart stores, leveraging factors such as date, store type, promotions, and environmental information. The model's output will be utilized to optimize inventory levels, preventing stock overflows or shortages that may disrupt product availability and enhance store operational efficiency and customer satisfaction.*

*This notebook will specifically center on predictive data modeling using regression methods.*

## **I. Introduction**

### **`Members:`**
- Faris Arief Mawardi
- Michael Nathaniel
- Nadia Nabilla Shafira
- Noufal Rifata Reyhan

### **`Objective:`**

Built an accurate sales forecasting model for various Walmart stores by leveraging factors such as date, store type, promotions, and environmental information. In addition, the results of the forecasting model will be used to optimize inventory levels to prevent excess or shortage of stock which could disrupt product availability and increase store operational efficiency and customer satisfaction.

### **`Project Workflow`**

1. **Introduction:**
    </br>Import the libraries needed in this project.

2. **Data Loading:**
    </br>Load the CSV file using Pandas DataFrame.

3. **Feature Engineering:**
    </br>involves strategically processing and transforming data before the modeling phase to enhance its predictive power. This crucial step enhances the quality of input variables, optimizing the performance of machine learning models.

4. **Model Definition:**
    </br>Define some fundamental models that serve as benchmarks for assessing the performance of more sophisticated approaches.

5. **Model Training:**
    </br>Train the model on the training data using the chosen optimization algorithm.
    </br>Monitor the training process and adjust hyperparameters as needed.

6. **Model Evaluation:**
    </br>Evaluate the model's performance on a separate test dataset to assess its generalization ability. Metrics such as accuracy, precision, recall, and F1 score can be used depending on the problem type.

7. **Conclusion:**
    </br>Craft a comprehensive conclusion summarizing the model's findings and insights, providing a thoughtful reflection on its performance and potential implications for future applications or improvements.

---
# **II. Import Libraries**

In [1]:
import pandas as pd
# Feature Engineering
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from feature_engine.outliers import Winsorizer
from sklearn.preprocessing import RobustScaler, OneHotEncoder, MinMaxScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# Data Modeling
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

# Model Evaluation
from sklearn.metrics import make_scorer, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import RandomizedSearchCV

# Model Saving
import pickle

# Others
import warnings
warnings.filterwarnings('ignore')

---
# **III. Data Loading**

> *This section details the process of loading the dataset that will be utilized in this project.*

In [2]:
# Loading the csv file
df = pd.read_csv("C:\BootcampHacktiv8\Phase2\FTDS-009-HCK-group-003\preprocessed_csv\Post-EDA.csv") # Read the csv from local file directory
df.head()

Unnamed: 0,Store,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday,Type,Size,Dept,Weekly_Sales
0,1,59.11,3.297,10382.9,6115.67,215.07,2406.62,6551.42,217.998085,7.866,False,A,151315,1,18689.54
1,1,59.11,3.297,10382.9,6115.67,215.07,2406.62,6551.42,217.998085,7.866,False,A,151315,2,44936.47
2,1,59.11,3.297,10382.9,6115.67,215.07,2406.62,6551.42,217.998085,7.866,False,A,151315,3,9959.64
3,1,59.11,3.297,10382.9,6115.67,215.07,2406.62,6551.42,217.998085,7.866,False,A,151315,4,36826.52
4,1,59.11,3.297,10382.9,6115.67,215.07,2406.62,6551.42,217.998085,7.866,False,A,151315,5,31002.65


---
# IV. Feature Engineering
> *In this Feature Engineering stage, two main processes will be carried out, namely Feature Selection and Feature Transformation, with the goal of enhancing the quality of the regression model to be constructed.*

## IV.1. Splitting Features dan Target
> *This stage involves the process of separating features from the target. In the context of this dataset, features refer to all data columns except the `Weekly_Sales` column. This is because the `Weekly_Sales` column serves as the target for this analysis.*

In [3]:
X = df.drop(['Weekly_Sales'], axis = 1)    # The features (X) consist of all columns except the 'Weekly_Sales' column.
y = df['Weekly_Sales']                     # The target (y) in this project is 'Weekly_Sales'.

## IV.2. Feature Selection
> In the Feature Selection stage, a selection will be made among the features or variables to be included in the model. The main objective is to choose the most relevant and significant features in predicting the target (in this case, `Weekly_Sales`). This process can be carried out through various methods, including personal judgment, domain and business knowledge, inter-feature correlations, and VIF (Variance Inflation Factor), especially for linear regression models.

### IV.2.1. Based on Domain Knowledge
> *In this step, feature selection will be carried out by leveraging domain knowledge to identify pertinent features. The feature `Unemployment` is deemed unnecessary for constructing a predictive model for `Weekly_Sales` due to its limited impact on consumer spending patterns in the retail domain.*

In [4]:
X = X.drop(['Unemployment'], axis=1)
print(f"Dimension:", X.shape)

Dimension: (97056, 13)


### IV.2.2. Based on Correlation
> *In this step, feature selection will be carried out by considering features with correlation values lower than to 0.1 with the target (`Weekly_Sales`). `IsHoliday` Feature, will be removed due to their correlation deemed as low based on the correlation matrix in the previous Exploratory Data Analysis (EDA) stage.*

In [5]:
X = X.drop(['IsHoliday'], axis=1)
print(f"Dimension:", X.shape)

Dimension: (97056, 12)


### IV.2.3. Based on VIF
> *In this step, the Variance Inflation Factor (VIF) values will be calculated to evaluate the assumption of multicollinearity in linear regression modeling. This assumption is crucial to ensure that there is no significant linear relationship between two or more features in the model. Additionally, the calculation of VIF values can be used as one of the criteria in the feature selection process.*

In [6]:
# define numeric columns
XNum = X[['Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Size']]

# create a function to calculate VIF
def calc_vif(XNum):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = XNum.columns
    vif["VIF"] = [variance_inflation_factor(XNum.values, i) for i in range(XNum.shape[1])]

    return(vif)

calc_vif(XNum)

Unnamed: 0,variables,VIF
0,Temperature,14.081955
1,Fuel_Price,26.97415
2,MarkDown1,6.468707
3,MarkDown2,1.283929
4,MarkDown3,1.065445
5,MarkDown4,4.32516
6,MarkDown5,1.772228
7,CPI,16.860154
8,Size,11.159717


According to the provided output, it is observed that certain features, including `Temperature`, `Fuel_Price`, `CPI`, and `Size`, exhibit VIF values exceeding 10. This indicates the presence of multicollinearity issues. Consequently, a linear model may not be well-suited for accurately representing this dataset.

## V.3. Splitting Data Train dan Data Test
> *In this stage, the dataset will be split into two main parts: the training data and the testing data. The purpose of this division is to train the model using the training data and subsequently evaluate its performance on the testing data.*

In [7]:
XTrain, XTest, yTrain, yTest = train_test_split(X, y, test_size = 0.3, random_state = 10)
print('Size X Train:', XTrain.shape)
print('Size X Test:', XTest.shape)

Size X Train: (67939, 12)
Size X Test: (29117, 12)


With a training-to-testing data ratio of 70:30, the training dataset comprises 67,939 entries, while the testing dataset consists of 29,117 entries.

## IV.4. Cardinality Analysis
> *Cardinality analysis is conducted to evaluate how many unique values exist in each categorical column in the dataset. Cardinality measures the diversity or variation of values within a column. Columns with low cardinality have a limited number of unique values, while those with high cardinality have many unique values or labels.*
> - _**Low Cardinality**: Columns with low cardinality often do not require special treatment. Machine learning models generally handle these features well without additional modifications._
> - _**High Cardinality**: Columns with high cardinality may require special attention. Common strategies involve value grouping or the use of techniques such as embedding to address the complexity that can arise from a large number of unique values._

In [8]:
print('Number of categories in the variable Store       : {}'.format(XTrain['Store'].nunique()))
print('Number of categories in the variable Type        : {}'.format(XTrain['Type'].nunique()))
print('Number of categories in the variable Dept        : {}'.format(XTrain['Dept'].nunique()))
print('Total number of data                             : {}'.format(len(XTrain)))

Number of categories in the variable Store       : 45
Number of categories in the variable Type        : 3
Number of categories in the variable Dept        : 81
Total number of data                             : 67939


In addition to the `Store` and `Dept` features, the number of unique values in other categorical features is relatively low, indicating the absence of cardinality issues. As for the `Store` and `Dept` features, which have a considerable number of unique values, they will not be addressed since both features serve as identifiers representing a store and a department, respectively.

## IV.5. Feature Transformation
> *During the feature transformation stage, several crucial tasks will be executed, encompassing outlier handling, scaling for numerical features, and encoding for categorical features.*

In [9]:
# Numerical Columns with Outliers at the Lower Bound
numCol1 = ['Temperature']
num1_pipeline = make_pipeline(Winsorizer(capping_method='iqr', tail='left', fold=1.5),    # capping extreme values using Winsorizer()
                             RobustScaler())                                              # scaling numerical features using RobustScaler()

# Numerical Columns with Outliers at the Upper Bound
numCol2 = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']
num2_pipeline = make_pipeline(Winsorizer(capping_method='iqr', tail='right', fold=3),     # capping extreme values using Winsorizer()
                             RobustScaler())                                              # scaling numerical features using RobustScaler()

# Numerical Columns without Outliers
numCol3 = ['Fuel_Price', 'CPI', 'Size']
num3_pipeline = make_pipeline(MinMaxScaler())                                             # scaling numerical features using MinMaxScaler()

# Nominal Scaled Columns
nomCol = ['Type']
nom_pipeline = make_pipeline(OneHotEncoder())                                             # scaling categorical features using OneHotEncoder()


prep = ColumnTransformer([
    ('numerik: Temperature', num1_pipeline, numCol1),
    ('numerik: MarkDown', num2_pipeline, numCol2),
    ('numerik: Fuel_Price, CPI, Size', num3_pipeline, numCol3),
    ('nominal', nom_pipeline, nomCol)],
    remainder='passthrough')
prep

In [10]:
XTrain_transformed = prep.fit_transform(XTrain)
XTest_transformed = prep.transform(XTest)

### Polynomial Features Transformation

In [11]:
poly = PolynomialFeatures(degree=2)

XTrain_poly = poly.fit_transform(XTrain_transformed)
XTest_poly = poly.transform(XTest_transformed)

---
# V. Model Definition
> *In this stage, several baseline models will be created using various machine learning and deep learning algorithms. The goal of this step is to test the initial performance of the selected algorithms, with the hope of identifying algorithms that exhibit good performance for predicting `Weekly_Sales`.*

In [12]:
# Polynomial Regression
pr = make_pipeline(LinearRegression())

# KNN
knn = make_pipeline(prep, KNeighborsRegressor())

# SVM
svm = make_pipeline(prep, SVR())

# Decision Tree
dt = make_pipeline(prep, DecisionTreeRegressor(random_state=10))

# Random Forest
rf = make_pipeline(prep, RandomForestRegressor(random_state=10))

# AdaBoost
ab = make_pipeline(prep, AdaBoostRegressor(random_state=10))

---
# VI. Model Training
> *In the Model Training phase, the primary goal is to use the prepared dataset to train a machine learning model. This involves feeding the model with the training data and enabling it to learn the patterns, relationships, and structures within the data.*

In [13]:
# Polynomial Regression
pr.fit(XTrain_poly, yTrain)

In [14]:
# KNN
knn.fit(XTrain, yTrain)

In [15]:
# SVM
svm.fit(XTrain, yTrain)

In [16]:
# Decision Tree
dt.fit(XTrain, yTrain)

In [17]:
# Random Forest
rf.fit(XTrain, yTrain)

In [None]:
# AdaBoost
ab.fit(XTrain, yTrain)

---
# VII. Model Evaluation
> *The Model Evaluation phase is dedicated to thoroughly assessing the performance of the trained machine learning model. The goal is to ensure that the model generalizes well to new, unseen data and to provide insights into its strengths and potential shortcomings.*

## VII.1. Cross-Validation
> *Cross-validation is a robust technique used to assess the performance and generalization ability of a machine learning model. It helps ensure that the model's evaluation is not overly dependent on a particular train-test split.*

### VII.1.1. Based on Mean Absolute Error Score

In [None]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Define regression-specific scoring functions
custom_mae_scorer = make_scorer(mean_absolute_error)

# Checking cross-validation score
cv_pr_model = cross_val_score(pr, XTrain_poly, yTrain, cv=kfold, scoring=custom_mae_scorer)
cv_knn_model = cross_val_score(knn, XTrain, yTrain, cv=kfold, scoring=custom_mae_scorer)
cv_svm_model = cross_val_score(svm, XTrain, yTrain, cv=kfold, scoring=custom_mae_scorer)
cv_dt_model = cross_val_score(dt, XTrain, yTrain, cv=kfold, scoring=custom_mae_scorer)
cv_rf_model = cross_val_score(rf, XTrain, yTrain, cv=kfold, scoring=custom_mae_scorer)
cv_ab_model = cross_val_score(ab, XTrain, yTrain, cv=kfold, scoring=custom_mae_scorer)

cv_scores = float('inf')
name_model = ""

for cv, name in zip(
    [cv_pr_model, cv_knn_model, cv_svm_model, cv_dt_model, cv_rf_model, cv_ab_model],
    ["pr", "knn", "svm", "dt", "rf", "ab"],
):
    print(name)
    print("MAE - All - Cross Validation  : ", cv)
    print("MAE - Mean - Cross Validation : ", cv.mean())
    print("MAE - Std - Cross Validation  : ", cv.std())
    print("MAE - Range of Test-Set       : ", (cv.mean() - cv.std()), "-", (cv.mean() + cv.std()))
    print("-" * 50)

    if cv.mean() < cv_scores:
        cv_scores = cv.mean()
        name_model = name
    else:
        pass

print("Best model:", name_model)
print("Cross-val mean MAE:", cv_scores)

pr
MAE - All - Cross Validation  :  [13845.95256275 13745.34954425 13993.02521637 14375.93496268
 13911.76843881]
MAE - Mean - Cross Validation :  13974.406144973891
MAE - Std - Cross Validation  :  216.54689682375954
MAE - Range of Test-Set       :  13757.859248150131 - 14190.953041797651
--------------------------------------------------
knn
MAE - All - Cross Validation  :  [7696.95263188 7594.71429614 7563.61255785 7915.64328893 7616.32143578]
MAE - Mean - Cross Validation :  7677.4488421171
MAE - Std - Cross Validation  :  127.00965067867355
MAE - Range of Test-Set       :  7550.439191438427 - 7804.458492795773
--------------------------------------------------
svm
MAE - All - Cross Validation  :  [14170.90541451 13953.43600856 14221.96030383 14776.13890544
 14144.70686242]
MAE - Mean - Cross Validation :  14253.429498952999
MAE - Std - Cross Validation  :  276.7100881256323
MAE - Range of Test-Set       :  13976.719410827367 - 14530.139587078631
-----------------------------------

Based on the cross-validation output, it is evident that the **_random forest_** model boasts the lowest Mean Absolute Error (MAE) value, specifically measured at 2520.77. This crucial metric signifies the average absolute difference between the model's predictions and the actual `Weekly_Sales` values. The MAE of 2520.77 implies that, on average, the random forest model's predictions deviate by approximately 2520.77 units from the true `Weekly_Sales` values.

### VII.1.2. Based on R-Squared Score

In [None]:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

# Define regression-specific scoring functions
custom_r2_scorer = make_scorer(r2_score)

# R-squared scores
cv_pr_model = cross_val_score(pr, XTrain_poly, yTrain, cv=kfold, scoring=custom_r2_scorer)
cv_r2_knn_model = cross_val_score(knn, XTrain, yTrain, cv=kfold, scoring=custom_r2_scorer)
cv_r2_svm_model = cross_val_score(svm, XTrain, yTrain, cv=kfold, scoring=custom_r2_scorer)
cv_r2_dt_model = cross_val_score(dt, XTrain, yTrain, cv=kfold, scoring=custom_r2_scorer)
cv_r2_rf_model = cross_val_score(rf, XTrain, yTrain, cv=kfold, scoring=custom_r2_scorer)
cv_r2_ab_model = cross_val_score(ab, XTrain, yTrain, cv=kfold, scoring=custom_r2_scorer)

name_r2_model = []
cv_r2_scores = 0

for cv_r2, name in zip(
    [cv_pr_model, cv_r2_knn_model, cv_r2_svm_model, cv_r2_dt_model, cv_r2_rf_model, cv_r2_ab_model],
    ["pr", "knn", "svm", "dt", "rf", "ab"],
):
    print(name)
    print("R2 - All - Cross Validation  : ", cv_r2)
    print("R2 - Mean - Cross Validation : ", cv_r2.mean())
    print("R2 - Std - Cross Validation  : ", cv_r2.std())
    print("R2 - Range of Test-Set       : ", (cv_r2.mean() - cv_r2.std()), "-", (cv_r2.mean() + cv_r2.std()))
    print("-" * 50)

    if cv_r2.mean() > cv_r2_scores:
        cv_r2_scores = cv_r2.mean()
        name_r2_model = name
    else:
        pass

print("Best model based on R2:", name_r2_model)
print("Cross-val mean R2:", cv_r2_scores)

pr
R2 - All - Cross Validation  :  [0.21084067 0.21203418 0.20922793 0.1979627  0.20932789]
R2 - Mean - Cross Validation :  0.20787867489400505
R2 - Std - Cross Validation  :  0.005065351008562631
R2 - Range of Test-Set       :  0.20281332388544243 - 0.21294402590256767
--------------------------------------------------
knn
R2 - All - Cross Validation  :  [0.64384327 0.65051975 0.66873696 0.65171956 0.65602116]
R2 - Mean - Cross Validation :  0.6541681370711803
R2 - Std - Cross Validation  :  0.008265511915583359
R2 - Range of Test-Set       :  0.6459026251555969 - 0.6624336489867636
--------------------------------------------------
svm
R2 - All - Cross Validation  :  [-0.10414067 -0.10244873 -0.10620806 -0.11673368 -0.10327512]
R2 - Mean - Cross Validation :  -0.10656125081483885
R2 - Std - Cross Validation  :  0.005237817981644285
R2 - Range of Test-Set       :  -0.11179906879648313 - -0.10132343283319456
--------------------------------------------------
dt
R2 - All - Cross Validat

The **_random forest_** model, based on the cross-validation results, is able to explain nearly 94% of the observed fluctuations in `Weekly_Sales`. This high R2 value indicates a strong fit of the model to the training data, implying that the features incorporated in the model contribute significantly to explaining the variations in weekly sales. It is a positive indicator of the model's ability to generalize and make accurate predictions on new, unseen data.

In [None]:
with open('rf.pkl', 'wb') as base_model:
  pickle.dump(rf, base_model)

## VII.2.1 Random Forest Model Improvement
> *In this step, the focus is on enhancing the performance of the random forest model through hyperparameter tuning. Hyperparameters are external configuration settings for a model that can significantly impact its performance. The goal is to systematically search through different combinations of hyperparameters to find the set that optimizes the model's performance.*

In [18]:
# Define parameters for Random Forest
params_rf = {
    'randomforestregressor__n_estimators': [50, 100, 200],
    'randomforestregressor__criterion': ['absolute_error'],
    'randomforestregressor__max_depth': [None, 10, 30, 50],
    'randomforestregressor__min_samples_split': [2, 5, 10],
    'randomforestregressor__min_samples_leaf': [1, 2, 4],
    'randomforestregressor__max_features': ['sqrt', 'log2', None]
}

# Scoring
scoring = {
    'MAE': make_scorer(mean_absolute_error, greater_is_better=False),
    'R2': make_scorer(r2_score)
}

# Create a RandomizedSearchCV instance for Random Forest
random_rf = RandomizedSearchCV(estimator=rf, param_distributions=params_rf, cv=3, n_jobs=-1, verbose=10, n_iter=10, scoring=scoring, refit='R2')
random_rf.fit(XTrain, yTrain)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [19]:
rf_best = random_rf.best_estimator_
rf_best

Based on the above output, following hyperparameter tuning, the optimal parameters for the RandomForestRegressor model are identified as follows:

- **Criterion**: 'absolute_error'
- **Max Depth**: 30
- **Max Features**: None
- **Min Samples Leaf**: 2
- **Min Samples Split**: 5
- **Random State**: 10

These tuned hyperparameters were selected after an exhaustive search across the hyperparameter space. The identified configuration is expected to enhance the model's performance based on the specified optimization criteria, providing a robust foundation for accurate predictions.

## VII.2.2 Decision Tree Model Improvement
> *In this step, the focus is on enhancing the performance of the Decision Tree model through hyperparameter tuning. Hyperparameters are external configuration settings for a model that can significantly impact its performance. The goal is to systematically search through different combinations of hyperparameters to find the set that optimizes the model's performance.*

In [26]:
# Define parameters for Decision Tree
params_dt = {
    'decisiontreeregressor__criterion': ['squared_error', 'absolute_error'],
    'decisiontreeregressor__splitter': ['best', 'random'],
    'decisiontreeregressor__max_depth': [None, 10, 30, 50],
    'decisiontreeregressor__min_samples_split': [2, 5, 10],
    'decisiontreeregressor__min_samples_leaf': [1, 2, 4],
    'decisiontreeregressor__max_features': ['auto', 'sqrt', 'log2', None]
}

# Scoring
scoring = {
    'MAE': make_scorer(mean_absolute_error, greater_is_better=False),
    'R2': make_scorer(r2_score)
}

# Create a RandomizedSearchCV instance for Decision Tree
random_dt = RandomizedSearchCV(estimator=dt, param_distributions=params_dt, cv=3, n_jobs=-1, verbose=10, n_iter=10, scoring=scoring, refit='R2')
random_dt.fit(XTrain, yTrain)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [27]:
dt_best = random_dt.best_estimator_
dt_best

Based on the above output, following hyperparameter tuning, the optimal parameters for the Decision Tree model are identified as follows:
- **Criterion**: 'absolute_error'
- **Max Depth**: 10
- **Min Samples Leaf**: 4
- **Min Samples Split**: 5
- **Random State**: 10
- **Splitter**: 'random'

These tuned hyperparameters are determined as the best configuration after systematically exploring the hyperparameter space. Implementing these parameters in the Decision Tree model is expected to enhance its predictive performance, as reflected by the chosen optimization criteria.

## VII.3.1. Evaluation of Optimized Decision Tree Model
> *In this stage, the emphasis is on assessing the performance of the fine-tuned decision tree model. The model underwent a process of hyperparameter tuning to enhance its configuration for more accurate predictions. The evaluation aims to gauge how well the optimized decision tree model generalizes to new, unseen data.*

In [28]:
# Melakukan evaluasi hasil tuning
yPredTrain_random_dt = dt_best.predict(XTrain)
yPredTest_random_dt = dt_best.predict(XTest)

# Menghitung R-squared pada train dan test
r2_train_random_dt = r2_score(yTrain, yPredTrain_random_dt)
r2_test_random_dt = r2_score(yTest, yPredTest_random_dt)

# MAE
mae_train_random_dt = mean_absolute_error(yTrain, yPredTrain_random_dt)
mae_test_random_dt = mean_absolute_error(yTest, yPredTest_random_dt)

#menampilkan hasil evaluasi
print('MAE - Data Train Hasil Tuning Decision Tree Regressor Model :', mae_train_random_dt)
print('MAE - Data Test Hasil Tuning Decision Tree Regressor Model :', mae_test_random_dt)
print('-'*90)
print('R-squared - Data Train Hasil Tuning Decision Tree Regressor Model :', r2_train_random_dt)
print('R-squared - Data Test Hasil Tuning Decision Tree Regressor Model :', r2_test_random_dt)

MAE - Data Train Hasil Tuning Decision Tree Regressor Model : 4602.432048892809
MAE - Data Test Hasil Tuning Decision Tree Regressor Model : 4955.721331530533
------------------------------------------------------------------------------------------
R-squared - Data Train Hasil Tuning Decision Tree Regressor Model : 0.8927622889299904
R-squared - Data Test Hasil Tuning Decision Tree Regressor Model : 0.8330277521302689


### VII.3.2. Evaluation of Optimized Random Forest Model
> *In this stage, the emphasis is on assessing the performance of the fine-tuned random forest model. The model underwent a process of hyperparameter tuning to enhance its configuration for more accurate predictions. The evaluation aims to gauge how well the optimized decision tree model generalizes to new, unseen data.*

In [29]:
# Melakukan evaluasi hasil tuning
yPredTrain_random_rf = rf_best.predict(XTrain)
yPredTest_random_rf = rf_best.predict(XTest)

# Menghitung R-squared pada train dan test
r2_train_random_rf = r2_score(yTrain, yPredTrain_random_rf)
r2_test_random_rf = r2_score(yTest, yPredTest_random_rf)

# MAE
mae_train_random_rf = mean_absolute_error(yTrain, yPredTrain_random_rf)
mae_test_random_rf = mean_absolute_error(yTest, yPredTest_random_rf)

#menampilkan hasil evaluasi
print('MAE - Data Train Hasil Tuning Random Forest Regressor Model :', mae_train_random_rf)
print('MAE - Data Test Hasil Tuning Random Forest Regressor Model :', mae_test_random_rf)
print('-'*90)
print('R-squared - Data Train Hasil Tuning Random Forest Regressor Model :', r2_train_random_rf)
print('R-squared - Data Test Hasil Tuning Random Forest Regressor Model :', r2_test_random_rf)

MAE - Data Train Hasil Tuning Random Forest Regressor Model : 1406.8024259512213
MAE - Data Test Hasil Tuning Random Forest Regressor Model : 2488.9255280454718
------------------------------------------------------------------------------------------
R-squared - Data Train Hasil Tuning Random Forest Regressor Model : 0.9774356672223008
R-squared - Data Test Hasil Tuning Random Forest Regressor Model : 0.9389155282535251


In [30]:
# Calculate deviation for MAE and R-squared
mae_deviation_DT = abs(mae_train_random_dt - mae_test_random_dt)
mae_deviation_RF = abs(mae_train_random_rf - mae_test_random_rf)

# Calculate deviation for R-squared
r2_deviation_DT = abs(r2_train_random_dt - r2_test_random_dt)
r2_deviation_RF = abs(r2_train_random_rf - r2_test_random_rf)

# Creating a DataFrame for comparison
data = {
    'Model': ['Decision Tree', 'Random Forest'],
    'MAE Train': [mae_train_random_dt, mae_train_random_rf],
    'MAE Test': [mae_test_random_dt, mae_test_random_rf],
    'MAE Deviation': [mae_deviation_DT, mae_deviation_RF],
    'R-squared Train': [r2_train_random_dt, r2_train_random_rf],
    'R-squared Test': [r2_test_random_dt, r2_test_random_rf],
    'R2 Deviation': [r2_deviation_DT, r2_deviation_RF]
}

comparison_df = pd.DataFrame(data)
comparison_df


Unnamed: 0,Model,MAE Train,MAE Test,MAE Deviation,R-squared Train,R-squared Test,R2 Deviation
0,Decision Tree,4602.432049,4955.721332,353.289283,0.892762,0.833028,0.059735
1,Random Forest,1406.802426,2488.925528,1082.123102,0.977436,0.938916,0.03852


**Insights :**

The Random Forest Regressor outperforms the Decision Tree Regressor in terms of both Mean Absolute Error (MAE) and R-squared for both training and testing datasets. This suggests better accuracy and generalization of the Random Forest model. Therefore, the Random Forest Regressor is a better choice as the final predictive model due to its superior performance in predicting the target variable.

---
# VIII. Model Saving
> *The optimized decision tree, identified as the best model, will be saved for future use in inference tasks.*

In [31]:
with open('best_rf.pkl', 'wb') as model:
  pickle.dump(rf_best, model)

---
# IX. Model Inference
> *Model inference will be conducted in the DS_inference.ipynb file.*

---
# X. Conclusion

- Based on the provided output, the random forest model exhibits robust performance on the training data, achieving an impressive R-squared (R2) score of 97.74%. This high R2 score indicates that roughly 97.74% of the variability in the target variable is effectively captured by the model's utilization of features. Simply put, the model demonstrates a commendable ability to elucidate and forecast the observed outcomes. Additionally, the model's accuracy is further evaluated through the Mean Absolute Error (MAE) on the training data, yielding an average absolute error of 1406.8 units. This suggests that, on average, the model's predictions deviate by approximately 1406.8 units from the actual values. A lower MAE is desirable, emphasizing the model's efficacy in minimizing prediction errors. In essence, the random forest model showcases both strong explanatory power and accuracy in predicting the target variable on the training data. These metrics collectively affirm the model's effectiveness in capturing patterns within the data and making reliable predictions.

- Examining the test data, the random forest model maintains a commendable performance, achieving an R-squared (R2) score of 93.89%. This signifies that approximately 93.89% of the variability in the target variable is successfully explained by the model's incorporation of features. The model's ability to generalize to new, unseen data is evident, as it effectively captures a substantial portion of the variability in the test dataset. Furthermore, the evaluation of the Mean Absolute Error (MAE) on the test data reveals an average absolute prediction error of 2488.93 units. In practical terms, this suggests that, on average, the model's predictions deviate by around 2488.93 units from the actual values in the test dataset.

- Upon meticulous examination, it becomes evident that the random forest model, following hyperparameter tuning, displays a subtle inclination towards overfitting. While this slight overfitting tendency is noteworthy, it remains within acceptable bounds in specific scenarios. Overfitting transpires when the model excessively tailors itself to the intricacies of the training data, potentially misinterpreting patterns as more universally applicable than they truly are. In nuanced contexts, a minor degree of overfitting may not pose a significant concern, particularly if the model's performance on new, unseen data remains satisfactory.