Skip to content

[FAQ] Spark longest trip distance returns unrealistic values #230

@AsherJD-io

Description

@AsherJD-io

Course

data-engineering-zoomcamp

Question

When computing the longest trip_distance using Spark on the yellow_tripdata_2023-11 dataset, the result sometimes returns extremely large values (for example 90771.9).

Is this expected or is something wrong with the Spark query?

Answer

This is expected and is caused by data quality issues in the NYC Taxi dataset.

Some rows contain unrealistic trip_distance values due to:

GPS errors

sensor faults

corrupted trip records

incorrect meter readings

When Spark calculates max(trip_distance) without filtering, these outliers are included in the result.

To obtain a realistic maximum distance, it is common to apply a simple filter before computing the maximum:

df.filter("trip_distance > 0 AND trip_distance < 200").selectExpr("max(trip_distance)").show()

This removes extreme outliers and produces values that better reflect real taxi trips.

Checklist

  • I have searched existing FAQs and this question is not already answered
  • The answer provides accurate, helpful information
  • I have included any relevant code examples or links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions