<center><h1><b>Sales Export Prediction using Pyspark</b></h1></center>

##### ```Delux is an online retailer based in UK that deals in a wide range of products in the following categories:```
- **dates:** sale date
- **order_value_EUR:** sale price in EUR
- **cost:** cost of goods sold in EUR
- **category:** item category
- **country:** customers' country at the time of purchase
- **customer_name:** name of customer
- **device_type:** The gadget used by customer to access our online store(PC, mobile, tablet)
- **sales_manager:** name of the sales manager for each sale
- **sales_representative:** name of the sales rep for each sale
- **order_id:** unique identifier of an order

##### ```The data was recorded for the period 1/2/2019 and 12/30/2020 with an aim to generate business insights to guide business direction. We would like to see what interesting insights the Kaggle community members can produce from this data.```

### **Imports**

In [15]:
import pandas as pd
import plotly.express as px
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer # Handling Categorical Features
from pyspark.ml.feature import VectorAssembler # We need to consider all the independent features in one list and class feature as other
from pyspark.sql.functions import regexp_replace # Type casting and setting up order_value_EUR column
from pyspark import __version__ as pyspark_version
from pyspark.ml.regression import LinearRegression
from pyspark.sql.functions import col, sum as sql_sum # Checking Null values

In [16]:
print(pyspark_version)

3.5.2


### **Setting up Spark Session**
##### ```Setting spark session is nessessry to use pyspark or apache spark's functionalities```

In [3]:
# Creating Spark Session for our Task
spark = SparkSession.builder.appName('sales_export_analysis').getOrCreate()

### **Reading Dataset**

In [4]:
# Reading CSV File using pyspark
### ---inferSchema sets appropriate datatype for the columns without that every column would be a string
### ---header sets 1st row as column names instead it gives default column names and let the 1st row be as it is
df_pyspark = spark.read.csv('Dataset/Sales-Export-dataset.csv', header=True, inferSchema=True)

In [5]:
# Checking structure and details of our dataset
df_pyspark.printSchema()

root
 |-- country: string (nullable = true)
 |-- order_value_EUR: string (nullable = true)
 |-- cost: double (nullable = true)
 |-- date: string (nullable = true)
 |-- category: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- sales_manager: string (nullable = true)
 |-- sales_rep: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- order_id: string (nullable = true)



In [6]:
df_pyspark.show()

+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+
| country|order_value_EUR|     cost|      date|   category|       customer_name|  sales_manager|          sales_rep|device_type|  order_id|
+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+
|  Sweden|      17,524.02| 14122.61| 2/12/2020|      Books|     Goldner-Dibbert|   Maxie Marrow|      Madelon Bront|     Mobile|70-0511466|
| Finland|     116,563.40| 92807.78| 9/26/2019|      Games|    Hilll-Vandervort|     Hube Corey|        Wat Bowkley|     Mobile|28-6585323|
|Portugal|     296,465.56|257480.34| 7/11/2019|   Clothing|      Larkin-Collier|Celine Tumasian| Smitty Culverhouse|         PC|58-7703341|
|Portugal|      74,532.02| 59752.32|  4/2/2020|     Beauty|   Hessel-Stiedemann|Celine Tumasian|       Aurelie Wren|         PC|14-6700183|
|   Spain|     178,7

### **Data Preprocessing PART # 1**

In [None]:
# Calculate the number of missing (NULL) values for each column in the PySpark DataFrame.
# This helps in identifying the extent of missing data in each column,
# which can be useful for data cleaning or preprocessing steps.
null_counts = df_pyspark.agg(*[sql_sum(col(c).isNull().cast("int")).alias(c) for c in df_pyspark.columns])

In [None]:
null_counts.show()

+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+
|country|order_value_EUR|cost|date|category|customer_name|sales_manager|sales_rep|device_type|order_id|
+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+
|      0|              0|   0|   0|       0|            0|            0|        0|          0|       0|
+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+



### **EDA**

In [17]:
# Convert PySpark DataFrame to Pandas for visualization
df_pandas = df_pyspark.toPandas()

#### **Explanation: Distribution of Order Value**
- **Why:** ```To understand the distribution of the sales amounts.```
- **What it Represents:** ```It helps identify patterns, outliers, and the overall range of order values in the dataset.```

In [24]:
# Distribution of Order Value
fig_order_value = px.histogram(df_pandas, x="order_value_EUR", nbins=30, title="Distribution of Order Value (EUR)")
fig_order_value.update_layout(xaxis_title='Order Value (EUR)', yaxis_title='Count')
fig_order_value.show()

#### **Explanation: Total Sales by Country**
- **Why:** ```To see the total sales volume generated by each country.```
- **What it Represents:** ```This helps identify which countries are contributing the most to total sales revenue.```

In [25]:
# Total Sales by Country
df_sales_by_country = df_pandas.groupby('country')['order_value_EUR'].sum().reset_index()
fig_sales_country = px.bar(df_sales_by_country, x='country', y='order_value_EUR', 
                           title="Total Sales by Country", labels={'order_value_EUR': 'Total Sales (EUR)'})
fig_sales_country.show()

#### **Explanation: Cost vs Order Value (Scatter Plot)**
- **Why:** ```To explore the relationship between the cost and the order value.```
- **What it Represents:** ```This scatter plot shows how order values correspond to their costs, helping identify any strong linear relationships or potential outliers.```

In [26]:
# Cost vs Order Value (Scatter Plot)
fig_cost_order = px.scatter(df_pandas, x="order_value_EUR", y="cost", 
                            color="country", title="Cost vs Order Value by Country")
fig_cost_order.update_layout(xaxis_title='Order Value (EUR)', yaxis_title='Cost')
fig_cost_order.show()

#### **Explanation: Count of Categories**
- **Why:** ```To visualize the distribution of categories and see which product types are sold the most.```
- **What it Represents:** ```It shows how frequently each category appears in the dataset, helping us understand which categories drive the most sales.```

In [23]:
# Count of each Category
fig_category_count = px.bar(df_pandas, x='category', title="Count of Categories", 
                            labels={'category': 'Category'}, color='category')
fig_category_count.update_layout(xaxis_title='Category', yaxis_title='Count', showlegend=False)
fig_category_count.show()

#### **Explanation: Sales over Time**
- **Why:** ```To analyze how sales are trending over time.```
- **What it Represents:** ```This line chart shows how sales have fluctuated over different time periods, revealing any seasonal patterns or growth trends.```

In [28]:
# Sales over Time
df_pandas['date'] = pd.to_datetime(df_pandas['date'], format='%m/%d/%Y')
df_sales_over_time = df_pandas.groupby('date')['order_value_EUR'].sum().reset_index()
fig_sales_time = px.line(df_sales_over_time, x='date', y='order_value_EUR', title="Sales Over Time")
fig_sales_time.update_layout(xaxis_title='Date', yaxis_title='Total Sales (EUR)')
fig_sales_time.show()

### **Data Preprocessing PART # 2**

##### **Handling Categorical Features**

In [31]:
columns_to_convert = ["country", "category", "device_type"]
for col_name in columns_to_convert:
    indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_indexed")
    df_pyspark = indexer.fit(df_pyspark).transform(df_pyspark)

df_pyspark.show()

+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+---------------+----------------+-------------------+
| country|order_value_EUR|     cost|      date|   category|       customer_name|  sales_manager|          sales_rep|device_type|  order_id|country_indexed|category_indexed|device_type_indexed|
+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+---------------+----------------+-------------------+
|  Sweden|      17,524.02| 14122.61| 2/12/2020|      Books|     Goldner-Dibbert|   Maxie Marrow|      Madelon Bront|     Mobile|70-0511466|            2.0|             4.0|                1.0|
| Finland|     116,563.40| 92807.78| 9/26/2019|      Games|    Hilll-Vandervort|     Hube Corey|        Wat Bowkley|     Mobile|28-6585323|            4.0|             1.0|                1.0|
|Portugal|     296,465.56|257480.34

##### **Type casting**

In [32]:
df_pyspark = df_pyspark.withColumn("order_value_EUR", regexp_replace(col("order_value_EUR"), ",", ""))

In [33]:
df_pyspark = df_pyspark.withColumn("order_value_EUR", col("order_value_EUR").cast("double"))

In [34]:
df_pyspark.filter(col("order_value_EUR").isNull()).show()

+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+---------------+----------------+-------------------+
|country|order_value_EUR|cost|date|category|customer_name|sales_manager|sales_rep|device_type|order_id|country_indexed|category_indexed|device_type_indexed|
+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+---------------+----------------+-------------------+
+-------+---------------+----+----+--------+-------------+-------------+---------+-----------+--------+---------------+----------------+-------------------+



In [35]:
df_pyspark.printSchema()

root
 |-- country: string (nullable = true)
 |-- order_value_EUR: double (nullable = true)
 |-- cost: double (nullable = true)
 |-- date: string (nullable = true)
 |-- category: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- sales_manager: string (nullable = true)
 |-- sales_rep: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- order_id: string (nullable = true)
 |-- country_indexed: double (nullable = false)
 |-- category_indexed: double (nullable = false)
 |-- device_type_indexed: double (nullable = false)



In [None]:
# df_pyspark = df_pyspark.withColumn("date", to_date(col("date"), "MM/dd/yyyy"))

In [None]:
# df_pyspark.show()

In [36]:
# We need to consider all the independent features in one list and class feature as other
featureassembler=VectorAssembler(inputCols=["country_indexed","order_value_EUR","category_indexed", "device_type_indexed"],outputCol="Independent Features")
output=featureassembler.transform(df_pyspark)

In [37]:
output.show()

+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+---------------+----------------+-------------------+--------------------+
| country|order_value_EUR|     cost|      date|   category|       customer_name|  sales_manager|          sales_rep|device_type|  order_id|country_indexed|category_indexed|device_type_indexed|Independent Features|
+--------+---------------+---------+----------+-----------+--------------------+---------------+-------------------+-----------+----------+---------------+----------------+-------------------+--------------------+
|  Sweden|       17524.02| 14122.61| 2/12/2020|      Books|     Goldner-Dibbert|   Maxie Marrow|      Madelon Bront|     Mobile|70-0511466|            2.0|             4.0|                1.0|[2.0,17524.02,4.0...|
| Finland|       116563.4| 92807.78| 9/26/2019|      Games|    Hilll-Vandervort|     Hube Corey|        Wat Bowkley|     Mobile|28-6585323|     

In [38]:
# Selecting only features and class column
finalized_data=output.select("Independent Features","cost")

In [39]:
finalized_data.show()

+--------------------+---------+
|Independent Features|     cost|
+--------------------+---------+
|[2.0,17524.02,4.0...| 14122.61|
|[4.0,116563.4,1.0...| 92807.78|
| (4,[1],[296465.56])|257480.34|
|[0.0,74532.02,5.0...| 59752.32|
|[8.0,178763.42,1....|146621.76|
|[8.0,84900.24,0.0...|  73701.9|
|[0.0,71620.08,4.0...| 62245.01|
|[3.0,156585.22,8....|126599.15|
|[0.0,78461.13,3.0...| 63537.82|
|[1.0,64827.8,3.0,...| 56043.63|
|[2.0,142664.34,3....|120808.16|
|[3.0,66673.19,6.0...| 52811.83|
|[0.0,136915.61,7....|114790.05|
|[3.0,164971.7,8.0...|132686.74|
|[3.0,149486.27,5....| 118662.2|
|[2.0,54078.92,4.0...| 46102.28|
|[1.0,107499.78,6....| 91364.06|
|[3.0,29493.79,1.0...| 24285.19|
|[0.0,147656.52,6....| 124193.9|
|[0.0,156839.31,6....|134709.28|
+--------------------+---------+
only showing top 20 rows



### **Machine Learning Model Implementation**

In [40]:
#train test split
train_data,test_data=finalized_data.randomSplit([0.75,0.25])

```Use these metrics to get a comprehensive view of your model's performance.```
- RMSE and MSE are good for understanding the magnitude of the errors.
- MAE provides a more interpretable measure of average error.
- R² helps understand how well the model explains the variance in the target variable.

#### **Interpreting Regression Model Performance Using Evaluation Metrics**

Interpreting the performance of a regression model using evaluation metrics like RMSE, MAE, MSE, and R² requires understanding the context of your data, the scale of the target variable (in this case, cost), and the specific use case. Here's how you can interpret each of these metrics to assess whether your model is performing well or not:

##### **1. Root Mean Squared Error (RMSE):**
- **Interpretation**: RMSE gives you a sense of how far, on average, your predictions are from the actual values. It’s in the same unit as the target variable (in this case, the cost).
- **Lower RMSE is better**: A lower RMSE indicates that your model's predictions are close to the actual values.
- **Thresholds for "Good" RMSE**: 
  - There isn’t a universal threshold for what is considered a good RMSE—it depends on the domain and the data. For example, if your target values (cost) range from 10,000 to 100,000 and your RMSE is 500, that would be quite good. But if your RMSE is 50,000, that would suggest large errors.

##### **2. Mean Absolute Error (MAE):**
- **Interpretation**: MAE represents the average absolute difference between the predicted and actual values, giving a more direct interpretation of the error in terms of actual units (cost).
- **Lower MAE is better**: Just like RMSE, a lower MAE means better performance.
- **Comparing MAE to RMSE**: 
  - If MAE is much lower than RMSE, it indicates that there are some significant outliers or large errors that RMSE is capturing more strongly (since RMSE squares the errors). If MAE and RMSE are close in value, the errors are more evenly spread.

##### **3. R-squared (R²):**
- **Interpretation**: R² measures how well your model explains the variance in the data. It ranges from 0 to 1, with higher values indicating that the model explains more of the variance.
- **Closer to 1 is better**: An R² of 1 means that your model explains all the variance in the target variable, while an R² of 0 means that it explains none.
- **Negative R²**: 
  - If R² is negative, it means that your model is performing worse than a simple mean model (which predicts the average value of the target variable for all observations).
- **Good R²**: In general:
  - 0.70 or higher: Strong model.
  - 0.50 to 0.70: Decent but could improve.
  - Below 0.50: Indicates the model may not be capturing enough of the variance in the data.

##### **4. Mean Squared Error (MSE):**
- **Interpretation**: MSE is similar to RMSE but without taking the square root. It provides a raw squared error value. It's often harder to interpret than RMSE because it squares the errors, so it's more sensitive to outliers.
- **Lower MSE is better**: Like RMSE, you want a lower MSE for better performance.
- **Sensitive to Outliers**: Since errors are squared, large errors (outliers) will disproportionately affect MSE, which might skew your interpretation if there are a few large outliers.

##### **5. Mean Absolute Percentage Error (MAPE):**
- **Interpretation**: MAPE expresses the error as a percentage of the actual values, making it easier to interpret in some business cases (e.g., forecasting).
- **Lower MAPE is better**: A lower MAPE indicates that your predictions are closer to the actual values.
- **Good Thresholds for MAPE**:
  - Less than 10%: Excellent prediction.
  - 10-20%: Good prediction.
  - 20-50%: Reasonable prediction.
  - Greater than 50%: The model isn’t very accurate.


##### **Linear Regression**

In [65]:
from pyspark.ml.evaluation import RegressionEvaluator

regressor=LinearRegression(featuresCol='Independent Features', labelCol='cost')
regressor_model=regressor.fit(train_data)

# regressor.coefficients
# regressor.intercept

# Predictions
# linear_regression_results=regressor.evaluate(test_data)

# Make predictions
linear_regression_predictions = regressor_model.transform(test_data)

# Initialize evaluators for different metrics
evaluator_rmse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mae")
evaluator_mse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mse")
evaluator_r2 = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="r2")

# Evaluate Linear Regression model predictions using RMSE
linear_regression_rmse = evaluator_rmse.evaluate(linear_regression_predictions)
print("RMSE for Linear Regression:", linear_regression_rmse)

# Evaluate Linear Regression model predictions using MAE
linear_regression_mae = evaluator_mae.evaluate(linear_regression_predictions)
print("MAE for Linear Regression:", linear_regression_mae)

# Evaluate Linear Regression model predictions using MSE
linear_regression_mse = evaluator_mse.evaluate(linear_regression_predictions)
print("MSE for Linear Regression:", linear_regression_mse)

# Evaluate Linear Regression model predictions using R-squared
linear_regression_r2 = evaluator_r2.evaluate(linear_regression_predictions)
print("R-squared (R2) for Linear Regression:", linear_regression_r2)

RMSE for Linear Regression: 3305.5206077492367
MAE for Linear Regression: 2560.949118899484
MSE for Linear Regression: 10926466.488254882
R-squared (R2) for Linear Regression: 0.9957595798232123


In [60]:
# Final comparison
# linear_regression_predictions.predictions.show()

# Show predictions
linear_regression_predictions.select("cost", "prediction").show()

+---------+------------------+
|     cost|        prediction|
+---------+------------------+
| 17148.18|16306.609917082884|
|  46713.8| 44459.70808984917|
| 47121.65| 45829.34280866668|
| 60442.51|60565.867119081864|
| 73454.25| 77061.64361490075|
| 84571.17| 81903.21617164585|
| 88066.99| 84667.61041955814|
| 89420.05| 92736.39086210588|
| 98171.79|102607.12402797394|
|110963.17|  106105.192666266|
|112140.88|114361.40319580601|
|123030.97|123215.62079140266|
|138872.91|132611.90076465235|
|193587.72|197936.37422967347|
|257480.34| 246472.9599450399|
| 23656.32|24277.634186529314|
| 26800.66|27328.076396795106|
| 31559.76|31528.945669458953|
| 31967.36|31541.225442797964|
| 33113.15|32008.645560094516|
+---------+------------------+
only showing top 20 rows



In [54]:
### Performance Metrics
# pred_results.r2,pred_results.meanAbsoluteError,pred_results.meanSquaredError

##### **Decesion Tree**

In [64]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.evaluation import RegressionEvaluator

# Train Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(featuresCol='Independent Features', labelCol='cost')
dt_model = dt_regressor.fit(train_data)

# Predictions
dt_predictions = dt_model.transform(test_data)

# Initialize evaluators for different metrics
evaluator_rmse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mae")
evaluator_mse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mse")
evaluator_r2 = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="r2")

# Evaluate dt model predictions using RMSE
dt_rmse = evaluator_rmse.evaluate(dt_predictions)
print("RMSE for Decision Tree:", dt_rmse)

# Evaluate GBT model predictions using MAE
dt_mae = evaluator_mae.evaluate(dt_predictions)
print("MAE for Decision Tree:", dt_mae)

# Evaluate dt model predictions using MSE
dt_mse = evaluator_mse.evaluate(dt_predictions)
print("MSE for Decision Tree:", dt_mse)

# Evaluate dt model predictions using R-squared
dt_r2 = evaluator_r2.evaluate(dt_predictions)
print("R-squared (R2) for Decision Tree:", dt_r2)

RMSE for Decision Tree: 7565.1164647800215
MAE for Decision Tree: 4365.365917604619
MSE for Decision Tree: 57230987.125685774
R-squared (R2) for Decision Tree: 0.9777893948783807


##### **Random Forest Regressor**

In [63]:
from pyspark.ml.regression import RandomForestRegressor

# Train Random Forest Regressor
rf_regressor = RandomForestRegressor(featuresCol='Independent Features', labelCol='cost')
rf_model = rf_regressor.fit(train_data)

# Predictions
rf_predictions = rf_model.transform(test_data)

# Initialize evaluators for different metrics
evaluator_rmse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mae")
evaluator_mse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mse")
evaluator_r2 = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="r2")

# Evaluate rf model predictions using RMSE
rf_rmse = evaluator_rmse.evaluate(rf_predictions)
print("RMSE for Random Forest:", rf_rmse)

# Evaluate GBT model predictions using MAE
rf_mae = evaluator_mae.evaluate(rf_predictions)
print("MAE for Random Forest:", rf_mae)

# Evaluate rf model predictions using MSE
rf_mse = evaluator_mse.evaluate(rf_predictions)
print("MSE for Random Forest:", rf_mse)

# Evaluate rf model predictions using R-squared
rf_r2 = evaluator_r2.evaluate(rf_predictions)
print("R-squared (R2) for Random Forest:", rf_r2)

RMSE for Random Forest: 8767.939555309484
MAE for Random Forest: 6570.686606827068
MSE for Random Forest: 76876764.04556066
R-squared (R2) for Random Forest: 0.9701651232138625


##### **Gradient Boosting Regressor**

In [51]:
from pyspark.ml.regression import GBTRegressor

# Train Gradient Boosting Regressor
gbt_regressor = GBTRegressor(featuresCol='Independent Features', labelCol='cost')
gbt_model = gbt_regressor.fit(train_data)

# Predictions
gbt_predictions = gbt_model.transform(test_data)

# Initialize evaluators for different metrics
evaluator_rmse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="rmse")
evaluator_mae = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mae")
evaluator_mse = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="mse")
evaluator_r2 = RegressionEvaluator(labelCol="cost", predictionCol="prediction", metricName="r2")

# Evaluate GBT model predictions using RMSE
gbt_rmse = evaluator_rmse.evaluate(gbt_predictions)
print("RMSE for Gradient Boosting:", gbt_rmse)

# Evaluate GBT model predictions using MAE
gbt_mae = evaluator_mae.evaluate(gbt_predictions)
print("MAE for Gradient Boosting:", gbt_mae)

# Evaluate GBT model predictions using MSE
gbt_mse = evaluator_mse.evaluate(gbt_predictions)
print("MSE for Gradient Boosting:", gbt_mse)

# Evaluate GBT model predictions using R-squared
gbt_r2 = evaluator_r2.evaluate(gbt_predictions)
print("R-squared (R2) for Gradient Boosting:", gbt_r2)

RMSE for Gradient Boosting: 7915.809153661712
MAE for Gradient Boosting: 4555.010760927906
MSE for Gradient Boosting: 62660034.557194546
R-squared (R2) for Gradient Boosting: 0.975682451861253


### **Conclusion**

 <table>
        <thead>
            <tr>
                <th>Model</th>
                <th>RMSE</th>
                <th>MAE</th>
                <th>MSE</th>
                <th>R-squared (R²)</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Linear Regression</td>
                <td>3305.52</td>
                <td>2560.95</td>
                <td>10,926,466.49</td>
                <td>0.9958</td>
            </tr>
            <tr>
                <td>Decision Tree</td>
                <td>7565.12</td>
                <td>4365.37</td>
                <td>57,230,987.13</td>
                <td>0.9778</td>
            </tr>
            <tr>
                <td>Random Forest</td>
                <td>8767.94</td>
                <td>6570.69</td>
                <td>76,876,764.05</td>
                <td>0.9702</td>
            </tr>
            <tr>
                <td>Gradient Boosting</td>
                <td>7915.81</td>
                <td>4555.01</td>
                <td>62,660,034.56</td>
                <td>0.9757</td>
            </tr>
        </tbody>
    </table>

##### **Root Mean Squared Error (RMSE):**
- **Best Model:** ```Linear Regression (3305.52)```
- **Interpretation:** ```Linear Regression has the lowest RMSE, indicating its predictions are, on average, closer to the actual values compared to the other models.```

##### Mean Absolute Error (MAE):
- **Best Model:** ```Linear Regression (2560.95)```
- **Interpretation:** ```Linear Regression also has the lowest MAE, suggesting it has the smallest average error in absolute terms.```

##### Mean Squared Error (MSE):
- **Best Model:** ```Linear Regression (10,926,466.49)```
- **Interpretation:** ```Linear Regression has the lowest MSE, meaning it has the smallest squared errors overall, which is a good sign of model accuracy.```

##### R-squared (R²):
- **Best model:** ```Linear Regression (0.9958)```
- **Interpretation:** ```Linear Regression has the highest R² value, indicating that it explains the most variance in the target variable compared to the other models.```

```Linear Regression performed the best across all the metrics (RMSE, MAE, MSE, and R²). It has the lowest RMSE, MAE, and MSE, and the highest R², indicating that it provides the most accurate and reliable predictions for your data.```

```Decision Tree, Random Forest, and Gradient Boosting are also strong models but have higher RMSE, MAE, and MSE values compared to Linear Regression, and slightly lower R² values.```