-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Course
data-engineering-zoomcamp
Question
When computing the longest trip_distance using Spark on the yellow_tripdata_2023-11 dataset, the result sometimes returns extremely large values (for example 90771.9).
Is this expected or is something wrong with the Spark query?
Answer
This is expected and is caused by data quality issues in the NYC Taxi dataset.
Some rows contain unrealistic trip_distance values due to:
GPS errors
sensor faults
corrupted trip records
incorrect meter readings
When Spark calculates max(trip_distance) without filtering, these outliers are included in the result.
To obtain a realistic maximum distance, it is common to apply a simple filter before computing the maximum:
df.filter("trip_distance > 0 AND trip_distance < 200").selectExpr("max(trip_distance)").show()
This removes extreme outliers and produces values that better reflect real taxi trips.
Checklist
- I have searched existing FAQs and this question is not already answered
- The answer provides accurate, helpful information
- I have included any relevant code examples or links