[FAQ] Spark longest trip distance returns unrealistic values

### Course

data-engineering-zoomcamp

### Question

When computing the longest trip_distance using Spark on the yellow_tripdata_2023-11 dataset, the result sometimes returns extremely large values (for example 90771.9).

Is this expected or is something wrong with the Spark query?

### Answer

This is expected and is caused by data quality issues in the NYC Taxi dataset.

Some rows contain unrealistic trip_distance values due to:

GPS errors

sensor faults

corrupted trip records

incorrect meter readings

When Spark calculates max(trip_distance) without filtering, these outliers are included in the result.

To obtain a realistic maximum distance, it is common to apply a simple filter before computing the maximum:

```
df.filter("trip_distance > 0 AND trip_distance < 200").selectExpr("max(trip_distance)").show()
```
This removes extreme outliers and produces values that better reflect real taxi trips.

### Checklist

- [x] I have searched existing FAQs and this question is not already answered
- [x] The answer provides accurate, helpful information
- [x] I have included any relevant code examples or links

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FAQ] Spark longest trip distance returns unrealistic values #230

Course

Question

Answer

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FAQ] Spark longest trip distance returns unrealistic values #230

Description

Course

Question

Answer

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions