Skip to content

Commit

Permalink
(converter) DirtreeSideloader now trims /index.html from the URL if p…
Browse files Browse the repository at this point in the history
…resent

This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
  • Loading branch information
vlofgren committed Sep 17, 2023
1 parent 9b385ec commit 98bcdf6
Showing 1 changed file with 6 additions and 0 deletions.
Expand Up @@ -71,6 +71,12 @@ private ProcessedDocument process(Path path) {
String body = Files.readString(path);
String url = urlBase + dirBase.relativize(path);

// We trim "/index.html"-suffixes from the index if they are present,
// since this is typically an artifact from document retrieval
if (url.endsWith("/index.html")) {
url = url.substring(0, url.length() - "index.html".length());
}

return sideloaderProcessing
.processDocument(url, body, extraKeywords, 10_000);
}
Expand Down

0 comments on commit 98bcdf6

Please sign in to comment.