Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
popplerextractor: don't try to guess the title if there isn't one.
Summary: Guessing the title of a document from metadata is not a winning strategy. The textbox with the biggest font may not be the title at all, could be an author, or a barcode using barcode image fonts [1]. The added heuristics of expecting a space ("very unlikely" would be the case for most academic papers, but it is very possible for a document to just be called "Statement" or "Invoice"), and even worse rejecting any title that has the word "Microsoft" (to work around documents exported by Word) is very finnicky, as it would reject titles such as "Analysis of somethingsomething feature on Microsoft Windows", which is clearly a valid title. Instead, don't try to be smart about the title extraction. Trust what the file says its title is, and if there is no title, live with it. This is simpler, faster and requires less code. [1] For instance FiServ-generated creditcard statements. See https://blog.flameeyes.eu/2017/09/how-i-leaked-my-own-credit-card-number/ for details. Reviewers: #frameworks, aacid, mgallien, bruns Reviewed By: mgallien, bruns Subscribers: kde-frameworks-devel, #baloo, bruns, michaelh, anthonyfieroni, mgallien, vhanda, ngraham, #frameworks Tags: #frameworks, #baloo Differential Revision: https://phabricator.kde.org/D8007
- Loading branch information