# MAIN TASK

Our main task is to predict **tip_amount** column in the dataset. We
find this interesting as an accurate fare prediction could be valuable for both drivers and
passengers to estimate trip costs.

We are going to solve a regression task, then the algorithms we are expecting to implement
are linear regression, decision tree regression, random forest regressor and gradient-boosted
tree regressor. We are going to be measuring the performance of each of them and
evaluating the results given with metrics like Root Mean Squared Error, Mean Absolute Error
and R-Squared.

In [17]:
import pyspark
from pyspark.sql.functions import min, max

spark = (pyspark.sql.SparkSession.builder
         .appName("tip prediction")
         .config("spark.executor.memory", "6g")
         .config("spark.driver.memory", "6g")
         .getOrCreate())

data = spark.read.parquet("yellow_tripdata_2024-12.parquet")
data.show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2| 2024-12-01 00:12:27|  2024-12-01 00:31:12|              1|         9.76|         1|                 N|         138|          33|           1|       38.0|  6.0|    0.5|      4.7

In [18]:
value_counts = data.groupBy("passenger_count").count()
value_counts.show()

+---------------+-------+
|passenger_count|  count|
+---------------+-------+
|              0|  30397|
|              7|      7|
|              6|  14205|
|              9|      2|
|              5|  22283|
|              1|2531706|
|              3| 133727|
|              8|      9|
|              2| 507628|
|              4| 102116|
|           NULL| 326291|
+---------------+-------+



Looking at this we observe a couple of things:
1. We have many null values which we are eliminating as are possibly wrong values inserted in the dataset
2. Also having over 30k 0 values shows that there might be a data entry error
3. Having very few samples of 6,7,8, and 9

Conclusions:

Add values of columns 6,7,8 and 9 will be sum as the standard sports of a car are 5 and we thing that by doing this we could catch up relation with customers asking for bigger cars.

# DATA CLEANING + FEATURE ENGINEERING

We have find out that the data might need some preprocessing as some columns might not follow a proper structure for the models we are attempting to implement.

It is necessary to understand the dataset first. As we need to find out which columns might be relevant for building our predictions. Checking the dictionary of the dataset: https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf we have found this columns relevant to predict the tip_amount:

- passenger_count: Number of passengers in the vehicle.
- trip_distance: distance in miles.
- PULocationID: Pickup point of the trip. It has numerical values, each one represents a different town.
- DOLocationID: Drooff point of the trip. Follows the same range as PULocationID
- fare_amount

### EXPLANATION:
- We have decided to substract from pickup_datetime and dropoff_datetime a column of trip_duration. By substracting dopoff and pickup times and dividing it to obtain the minutes which might be useful as the models could perform better with it.
- We have also decided to create a column pickup_hour taken from the  column which we find it is one of the most relevant features as taxis have different fees depending on the hour.
- A new column day_of_week is added stracting its values from pickup_datetime
- In the sample print at the beginning of the notebook we have realized that there is one row which has negative values in columns that must be >=0 so we have decided to apply a filter for those cases.
- For passenger count column we are adapting the range of values of columns 6,7,8 and 9. Now the value 6 will represent 6 or more passengers. This is tue the lack of samples of values 7,8 or 9 which could cause a little of confusion for our models.



In [19]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.sql.functions import unix_timestamp, hour, col, dayofweek, when

data = data.dropna(subset=["PULocationID", "DOLocationID","passenger_count", "trip_distance", "fare_amount", "tip_amount", "tpep_pickup_datetime", "tpep_dropoff_datetime"])


dataClean = data.filter((col("trip_distance") > 0) & (col("fare_amount") > 0) & (col("tip_amount") >= 0))
dataFeat = dataClean.withColumn("trip_duration", (unix_timestamp("tpep_dropoff_datetime") - unix_timestamp("tpep_pickup_datetime"))/60)

dataFeat = dataFeat.withColumn("passenger_count_grouped", when(col("passenger_count") >= 6, 6).otherwise(col("passenger_count")))

#As the range of values goes from 1-7 and we want it to start at 0, we substract 1.
dataFeat = dataFeat.withColumn("day_of_week", dayofweek("tpep_pickup_datetime")-1)
dataFeat = dataFeat.withColumn("pickup_hour",hour("tpep_pickup_datetime"))
dataFeat = dataFeat.dropna(subset=["trip_duration", "pickup_hour", "day_of_week"])

columns = ["passenger_count_grouped","trip_distance","PULocationID","DOLocationID","trip_duration", "pickup_hour", "tip_amount", "day_of_week"]
selectedData = dataFeat.select(columns)
selectedData.show()

+-----------------------+-------------+------------+------------+-------------------+-----------+----------+-----------+
|passenger_count_grouped|trip_distance|PULocationID|DOLocationID|      trip_duration|pickup_hour|tip_amount|day_of_week|
+-----------------------+-------------+------------+------------+-------------------+-----------+----------+-----------+
|                      1|         9.76|         138|          33|              18.75|          0|      4.72|          0|
|                      1|         7.62|         158|          42|  32.18333333333333|         23|      8.46|          6|
|                      4|        20.07|         132|         236|  34.18333333333333|          0|       0.0|          0|
|                      3|         2.34|         142|         186|               15.0|          0|      4.12|          0|
|                      1|         5.05|         107|          80|               22.2|          0|       5.0|          0|
|                      1|       

In [20]:
value_counts = selectedData.groupBy("passenger_count_grouped").count()
value_counts.show()

+-----------------------+-------+
|passenger_count_grouped|  count|
+-----------------------+-------+
|                      0|  29150|
|                      6|  14091|
|                      5|  22057|
|                      1|2450879|
|                      3| 128633|
|                      2| 489550|
|                      4|  95171|
+-----------------------+-------+



## CATEGORICAL COLUMNS(INDEXERS AND ENCODERS)

To handle categorical columns we are using StringIndexer. As each value in the column represents a town in NYC and this can cause problems as for example zone 100 is not twice good as zone 50, are just different places.

For this StringIndexer is designed to convert a column of labels into a column of label indices.

The same methodology is applied for columns day_of_week pickup_hour and passenger_count_grouped

In [21]:
indexer_pickup = StringIndexer(inputCol="PULocationID", outputCol="PULocationID_index", handleInvalid="keep")
indexer_dropoff = StringIndexer(inputCol="DOLocationID", outputCol="DOLocationID_index", handleInvalid="keep")

encoder_pickup = OneHotEncoder(inputCols=["PULocationID_index"], outputCols=["PULocationID_vec"], dropLast=False, handleInvalid="keep")
encoder_drop = OneHotEncoder(inputCols=["DOLocationID_index"], outputCols=["DOLocationID_vec"], dropLast=False, handleInvalid="keep")

indexer_day = StringIndexer(inputCol="day_of_week", outputCol="day_of_week_index", handleInvalid="keep")
indexer_hour = StringIndexer(inputCol="pickup_hour", outputCol="pickup_hour_index", handleInvalid="keep")
indexer_pass = StringIndexer(inputCol="passenger_count_grouped", outputCol="passenger_count_index", handleInvalid="keep")

encoder_day = OneHotEncoder(inputCols=["day_of_week_index"], outputCols=["day_of_week_vec"], dropLast=True)
encoder_hour = OneHotEncoder(inputCols=["pickup_hour_index"], outputCols=["pickup_hour_vec"], dropLast=True)
encoder_pass = OneHotEncoder(inputCols=["passenger_count_index"], outputCols=["passenger_count_vec"], dropLast=True)

# MODEL BUILDING

## MODEL TRAINING (LINEAR REGRESSION)

The steps we are following in order to train the different models are:

1. VectorAssembler: We have combined various ride details like trip_distance, fare_amount, and vectorized locations (PULocationID_vec, DOLocationID_vec), time (day_of_week_vec, pickup_hour_vec), and passenger count (passenger_count_vec) into one 'features' input for my model.

2. LinearRegression (lr): We have set up a Linear Regression model to learn how to predict the tip_amount based on that combined features vector.

3. ParamGridBuilder (paramGrid): We have defined a set of different tuning parameters (like regularization) for my Linear Regression, so I can test which combination works best.

4. RegressionEvaluator (evaluator): This will measure how good my model is by calculating the Mean Absolute Error (MAE) between its prediction and the actual tip_amount.

5. Pipeline: We have chained all my data prep steps (indexing, encoding, assembling) and the Linear Regression model into a single, streamlined workflow.

6. CrossValidator (cv): This tool will automatically train and test of our pipeline with all the different parameter sets from paramGrid using 3-fold cross-validation, picking the best one based on the MAE.

7. Data Sampling & Splitting: We have sampled 20% of the data (as in first stage the model took a vast amount of time to train) and then split it into dataTrain (for training/cross-validation) and dataTest (for final, unseen evaluation) to speed things up and properly assess performance.

8. Final Evaluation: After cv.fit(dataTrain) finds the best model (cvModel), We use it to make predictions on the dataTest. Then, We calculate MAE, RMSE, MSE, and R2 to see how well this final model performs on completely new data.

We use a VectorAssembler to combine relevant features into a single vector for the model.

In [22]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.pipeline import Pipeline

assemblerCols = ["trip_distance","PULocationID_vec","DOLocationID_vec","trip_duration", "day_of_week_vec", "pickup_hour_vec", "passenger_count_vec"
]

assembler = VectorAssembler(inputCols=assemblerCols, outputCol='features')

We set up the Linear Regression model and create a pipeline that includes all preprocessing steps and the model itself.

In [23]:
lr = LinearRegression(labelCol="tip_amount", featuresCol="features")
paramGrid = (ParamGridBuilder().addGrid(lr.regParam,[0.01,0.1,0.5])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [50, 100])
             .build())

evaluator = RegressionEvaluator(labelCol='tip_amount', predictionCol='prediction', metricName='mae')
pipeline = Pipeline(stages=[
    indexer_pickup, indexer_dropoff,
    indexer_day, indexer_hour, indexer_pass,
    encoder_pickup, encoder_drop,
    encoder_day, encoder_hour, encoder_pass,
    assembler, lr
])


We use cross-validation to tune hyperparameters and split the data for training and testing.


In [24]:
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3, parallelism=2)

sampledData = selectedData.sample(withReplacement=False, fraction=0.2, seed=42)
dataTrain, dataTest = sampledData.randomSplit([0.8,0.2], seed=42)
dataTrain.show()

+-----------------------+-------------+------------+------------+------------------+-----------+----------+-----------+
|passenger_count_grouped|trip_distance|PULocationID|DOLocationID|     trip_duration|pickup_hour|tip_amount|day_of_week|
+-----------------------+-------------+------------+------------+------------------+-----------+----------+-----------+
|                      0|          0.1|          48|          48|0.6666666666666666|          6|       0.0|          4|
|                      0|          0.1|         186|         164|              10.2|         18|       2.8|          6|
|                      0|          0.1|         262|         262|              0.65|         18|       0.0|          5|
|                      0|          0.1|         265|         265|0.5333333333333333|          0|       0.0|          5|
|                      0|          0.2|         100|         164|               5.0|         10|       2.1|          6|
|                      0|          0.2| 

Fit the cross-validated pipeline on the training data, make predictions, and compute evaluation metrics.

In [25]:
cvModel = cv.fit(dataTrain)

predictions = cvModel.transform(dataTest)
mae = evaluator.evaluate(predictions)
print("MAE:", mae)

rmse_eval = RegressionEvaluator(labelCol='tip_amount', predictionCol='prediction', metricName='rmse')
rmse = rmse_eval.evaluate(predictions)
print("RMSE:",rmse)

mse_eval = RegressionEvaluator(labelCol='tip_amount', predictionCol='prediction', metricName='mse')
mse = mse_eval.evaluate(predictions)
print("MSE:",mse)

r2_evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="r2")
r2 = r2_evaluator.evaluate(predictions)
print("R2:", r2)

                                                                                

MAE: 2.149131881325228
RMSE: 3.359277072196714
MSE: 11.284742447786527
R2: 0.34394445866483747


### RESULTS:

Linear Regression provides a simple baseline model. With an R² score of 0.34, it explains approximately 34% of the variance in the target variable. This indicates that it may struggle to model complex patterns in the data. While it's not highly accurate, its interpretability and speed make it a reasonable starting point for regression problems. 

### POSSIBLE IMPROVEMENTS:

After some testing how adding more columns (like: pickup_hour, day_of_week,
    is_night, is_weekend) could increase or decrease our model performance. The results obtained were worse, then we concluded leaving they are now, the other results obtained are:

MAE on Test data: 2.1867288596114304
RMSE on Test Data: 3.5895547766976675
MSE on Test Data: 12.365532960156768

## DECISION TREE REGRESSOR

Decision Trees can capture non-linear relationships and interactions between features. Here, we use a Decision Tree Regressor to predict tip amounts.


Set up the Decision Tree model and create a pipeline with relevant preprocessing steps.


In [26]:
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
dt = DecisionTreeRegressor(featuresCol="features", labelCol="tip_amount", seed=42)

paramGridDT = (ParamGridBuilder()
               .addGrid(dt.maxDepth, [5, 10, 15]).addGrid(dt.maxBins, [32,64])
               .build())

assemblerDecTree = VectorAssembler(
    inputCols=[
        "trip_distance", "trip_duration", "pickup_hour",
        "day_of_week_vec",
        "PULocationID_vec", "DOLocationID_vec"
    ],
    outputCol="features"
)
pipeline = Pipeline(stages=[
    indexer_pickup, indexer_dropoff,
    indexer_day,
    encoder_pickup, encoder_drop,
    encoder_day,
    assemblerDecTree,
    dt
])
evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mae")

cvDT = CrossValidator(estimator=pipeline,
                      estimatorParamMaps=paramGridDT,
                      evaluator=evaluator,
                      numFolds=3)

Fit the model, make predictions, and compute evaluation metrics.


In [27]:
modelDT = cvDT.fit(dataTrain)
predictions = modelDT.transform(dataTest)

mae = evaluator.evaluate(predictions)
rsme_evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="rmse")
r2_evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="r2")
mse_evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mse")

print("MAE:", mae)
print("RMSE:", rsme_evaluator.evaluate(predictions))
print("MSE:", mse_evaluator.evaluate(predictions))
print("R2:", r2_evaluator.evaluate(predictions))

25/05/20 22:58:34 WARN DAGScheduler: Broadcasting large task binary with size 1031.5 KiB
25/05/20 22:58:40 WARN DAGScheduler: Broadcasting large task binary with size 1086.8 KiB
25/05/20 22:59:00 WARN DAGScheduler: Broadcasting large task binary with size 1058.3 KiB
25/05/20 22:59:06 WARN DAGScheduler: Broadcasting large task binary with size 1053.0 KiB
                                                                                

MAE: 2.0265963049153024
RMSE: 3.220148866862463
MSE: 10.369358724755607
R2: 0.3971616735654967


### RESULTS

The Decision Tree model improves on Linear Regression in terms of all metrics, particularly showing a better fit with a slightly higher R² score (0.39). This suggests that it captures non-linear patterns more effectively, though it may be prone to overfitting without pruning or depth control.

## RANDOM FOREST REGRESSOR


Random Forests are ensembles of decision trees that improve predictive performance and reduce overfitting. Here, we use a Random Forest Regressor for tip prediction.

In [28]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline

rf = RandomForestRegressor(featuresCol="features", labelCol="tip_amount")

paramGridRF = (ParamGridBuilder()
               .addGrid(rf.numTrees, [10, 20])
               .addGrid(rf.maxDepth, [5, 10])
               .build())

pipeline = Pipeline(stages=[
    indexer_pickup, indexer_dropoff,
    indexer_day, indexer_hour, indexer_pass,
    encoder_pickup, encoder_drop,
    encoder_day, encoder_hour, encoder_pass,
    assembler, rf
])
evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mae")

cvRF = CrossValidator(estimator=pipeline,
                      estimatorParamMaps=paramGridRF,
                      evaluator=evaluator,
                      numFolds=3,
                      parallelism=2)


For this model, we are using a sample in order to reduce the training times.

In [29]:
sampledData = selectedData.sample(withReplacement=False, fraction=0.2, seed=42)

Fit the model, make predictions, and compute evaluation metrics.


In [30]:
cvModelRF = cvRF.fit(dataTrain)
predRF = cvModelRF.transform(dataTest)

evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mae")
print("MAE:", evaluator.evaluate(predRF))

rmse_eval = RegressionEvaluator(labelCol='tip_amount', predictionCol='prediction', metricName='rmse')
rmse = rmse_eval.evaluate(predRF)
print("RMSE:",rmse)

mse_eval = RegressionEvaluator(labelCol='tip_amount', predictionCol='prediction', metricName='mse')
mse = mse_eval.evaluate(predRF)
print(f"MSE:",mse)

r2_evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="r2")
r2 = r2_evaluator.evaluate(predRF)
print("R2:", r2)

25/05/20 22:59:27 WARN DAGScheduler: Broadcasting large task binary with size 1415.0 KiB
25/05/20 22:59:28 WARN DAGScheduler: Broadcasting large task binary with size 2003.8 KiB
25/05/20 22:59:38 WARN DAGScheduler: Broadcasting large task binary with size 1176.0 KiB
25/05/20 22:59:40 WARN DAGScheduler: Broadcasting large task binary with size 1778.8 KiB
25/05/20 22:59:41 WARN DAGScheduler: Broadcasting large task binary with size 2.6 MiB
25/05/20 22:59:43 WARN DAGScheduler: Broadcasting large task binary with size 3.7 MiB
25/05/20 22:59:53 WARN DAGScheduler: Broadcasting large task binary with size 1015.8 KiB
25/05/20 22:59:55 WARN DAGScheduler: Broadcasting large task binary with size 1476.5 KiB
25/05/20 22:59:56 WARN DAGScheduler: Broadcasting large task binary with size 2.1 MiB
25/05/20 23:00:06 WARN DAGScheduler: Broadcasting large task binary with size 1162.1 KiB
25/05/20 23:00:08 WARN DAGScheduler: Broadcasting large task binary with size 1774.2 KiB
25/05/20 23:00:09 WARN DAGSche

MAE: 2.0483346781582537


                                                                                

RMSE: 3.2245523545605392
MSE: 10.397737887301918
R2: 0.39551180810041375


### RESULTS

Random Forest performs nearly identically to the Decision Tree in terms of error and variance explained. However, as an ensemble method, it typically generalizes better and reduces overfitting by averaging across many trees, making it a more stable and reliable option in practice.

## Gradient-Boosted Tree Regressor

Gradient-Boosted Trees build an ensemble of trees sequentially, each correcting the errors of the previous one. This often leads to strong predictive performance.

Set up the GBT model and create a pipeline with all preprocessing steps.


In [31]:
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(labelCol="tip_amount", featuresCol="features", maxIter=30)

paramGrid = (ParamGridBuilder()
    .addGrid(gbt.maxDepth, [5, 7])
    .addGrid(gbt.maxIter, [50, 100])
    .addGrid(gbt.stepSize, [0.1, 0.2])
    .build())

evaluator = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mae")

pipeline = Pipeline(stages=[
    indexer_pickup, indexer_dropoff,
    indexer_day, indexer_hour, indexer_pass,
    encoder_pickup, encoder_drop,
    encoder_day, encoder_hour, encoder_pass,
    assembler, gbt
])

cv = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=2
)


Fit the model, make predictions, and compute evaluation metrics.


In [32]:
cvModel = cv.fit(dataTrain)

predictions = cvModel.transform(dataTest)

mae = evaluator.evaluate(predictions)
print("MAE:", mae)

rmse_eval = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="rmse")
rmse = rmse_eval.evaluate(predictions)
print("RMSE:", rmse)

mse_eval = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="mse")
mse = mse_eval.evaluate(predictions)
print("MSE:", mse)

r2_eval = RegressionEvaluator(labelCol="tip_amount", predictionCol="prediction", metricName="r2")
r2 = r2_eval.evaluate(predictions)
print("R2:", r2)

25/05/20 23:01:17 WARN BlockManager: Asked to remove block rdd_16381_4, which does not exist
25/05/20 23:01:24 WARN BlockManager: Asked to remove block rdd_16567_13, which does not exist
25/05/20 23:01:24 WARN BlockManager: Asked to remove block rdd_16567_6, which does not exist
25/05/20 23:02:27 WARN BlockManager: Asked to remove block rdd_18840_12, which does not exist
25/05/20 23:02:27 WARN BlockManager: Asked to remove block rdd_18840_9, which does not exist
25/05/20 23:02:27 WARN BlockManager: Asked to remove block rdd_18840_11, which does not exist
25/05/20 23:02:27 WARN BlockManager: Asked to remove block rdd_18840_13, which does not exist
25/05/20 23:02:27 WARN BlockManager: Asked to remove block rdd_18840_6, which does not exist
25/05/20 23:02:55 WARN BlockManager: Asked to remove block rdd_19808_12, which does not exist
25/05/20 23:02:55 WARN BlockManager: Asked to remove block rdd_19808_7, which does not exist
25/05/20 23:02:55 WARN BlockManager: Asked to remove block rdd_19

MAE: 1.9779096838889811


                                                                                

RMSE: 3.1827300997375256


                                                                                

MSE: 10.129770887775239




R2: 0.41109047422840483


                                                                                

### RESULTS

GBT Regressor achieves the best performance across all metrics: lowest MAE, RMSE, and MSE, and the highest R² (0.41). This suggests it is not only more accurate in its predictions on average but also better at explaining the variance in the data. As a boosting method, it incrementally corrects errors from previous models, leading to improved accuracy and generalization when properly tuned.

## FINAL MODEL COMPARISON

| Model             | MAE  | RMSE | MSE   | R²   |
| ----------------- | ---- | ---- | ----- | ---- |
| Linear Regression | 2.15 | 3.35 | 11.28 | 0.34 |
| Decision Tree     | 2.02 | 3.22 | 10.36 | 0.39 |
| Random Forest     | 2.04 | 3.22 | 10.37 | 0.39 |
| GBT Regressor     | 1.96 | 3.18 | 10.12 | 0.41 |


The GBT Regressor stands out as the most effective model in this evaluation. It outperforms the others across all key metrics, indicating better average prediction accuracy and stronger ability to capture the underlying patterns in the data.

While Random Forest and Decision Tree offer good performance with similar results, GBT is the best choice when maximum predictive accuracy and variance explanation are desired. However, it may come with higher computational cost and tuning complexity.

## References:

https://stackoverflow.com/questions/73524197/pyspark-performing-one-hot-encoding

https://www.datatechnotes.com/2021/05/mllib-linear-regression-example-with.html

https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html

https://github.com/RezvanRah/ML_TaxiFare_Prediction

https://www.youtube.com/watch?v=HK4YW9qvQE8