Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acceptance tests keeps running for hours #1587

Closed
davidgamez opened this issue Sep 26, 2023 · 2 comments · Fixed by #1590
Closed

Acceptance tests keeps running for hours #1587

davidgamez opened this issue Sep 26, 2023 · 2 comments · Fixed by #1590

Comments

@davidgamez
Copy link
Member

Description

Acceptance tests kept running without stopping and had to be canceled.

Examples of long-running tests:

@jcpitre
Copy link
Contributor

jcpitre commented Sep 27, 2023

I tested with this dataset: https://storage.googleapis.com/storage/v1/b/mdb-latest/o/de-unknown-ulmer-eisenbahnfreunde-gtfs-1081.zip?alt=media (It's 900 MB with 40 000 000 shape rows)
I had to increase the heap to 8GB, if not there would be an out-of-memory error.
With 8GB it did run seemingly forever.

It seems the problem started with this PR #1553

I locally removed the code from this PR and it ran fast.

Possible solutions:

  • There might be ways to optimize the code.
  • We should try to establish limits over which we don't run a validator (and issue a notice about that) Initially we should target the trip vs shape distance validation.

Also it would be useful to add more logging. I don't think there is any DEBUG logging in the validator.
This would help pinpoint problems faster. For example if there was a DEBUG log before and after calling a validator, it would have told us what takes a lot of time.

@bdferris-v2
Copy link
Collaborator

Looking at #1553, there is definitely so inefficient code in there.

For example:

    List<String> uniqueShapeIds =
        shapeTable.getEntities().stream()
            .map(GtfsShape::shapeId)
            .distinct()
            .collect(Collectors.toList());

This is iterating over every line in shapes.txt, when the set of unique shape ids is already available via GtfsShapeTableContainer#byShapeIdMap().

Then, the worst offender:

    uniqueShapeIds.forEach(
        shapeId -> {
          double maxShapeDist =
              shapeTable.getEntities().stream()
                  .filter(s -> s.shapeId().equals(shapeId))
                  .mapToDouble(GtfsShape::shapeDistTraveled)
                  .max()
                  .orElse(Double.NEGATIVE_INFINITY);

For each shape ids, we are again looping over every entry in shapes.txt to find matching shape points. This is where the NxM blow-up and slow-down is likely coming from. Again GtfsShapeTableContainer#byShapeIdMap() already has all the shape points grouped by shape id and should be used directly.

I'd also point out that the shape points and stop points below should both probably be filtered by hasShapeDistTraveled as well.

cka-y added a commit that referenced this issue Sep 29, 2023
cka-y added a commit that referenced this issue Oct 11, 2023
* fix: comments on #1587

* fix: removed unused variable

* fix: formatting

* fix: rollback sources for acceptance test

* fix: removing source causing timeout

* fix: removing source causing timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants