Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tweaks to pub date heuristics to make it mostly get the 'historyofphi…
…losophy.net' case right. Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
- Loading branch information
Showing
8 changed files
with
70 additions
and
19 deletions.
There are no files selected for viewing
38 changes: 38 additions & 0 deletions
38
...features-convert/pubdate/src/main/java/nu/marginalia/pubdate/PubDateFromHtmlStandard.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
package nu.marginalia.pubdate; | ||
|
||
import nu.marginalia.converting.model.HtmlStandard; | ||
|
||
public class PubDateFromHtmlStandard { | ||
/** Used to bias pub date heuristics */ | ||
public static int blindGuess(HtmlStandard standard) { | ||
return switch (standard) { | ||
case PLAIN -> 1993; | ||
case HTML123 -> 1997; | ||
case HTML4, XHTML -> 2006; | ||
case HTML5 -> 2018; | ||
case UNKNOWN -> 2000; | ||
}; | ||
} | ||
|
||
/** Sanity check a publication year based on the HTML standard. | ||
* It is for example unlikely for a HTML5 document to be published | ||
* in 1998, since that is 6 years before the HTML5 standard was published. | ||
* <p> | ||
* Discovering publication year involves a lot of guesswork, this helps | ||
* keep the guesses relatively sane. | ||
*/ | ||
public static boolean isGuessPlausible(HtmlStandard standard, int year) { | ||
switch (standard) { | ||
case HTML123: | ||
return year <= 2000; | ||
case XHTML: | ||
case HTML4: | ||
return year >= 2000; | ||
case HTML5: | ||
return year >= 2014; | ||
default: | ||
return true; | ||
} | ||
} | ||
|
||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters