Skip to content

Commit

Permalink
popplerextractor: don't try to guess the title if there isn't one.
Browse files Browse the repository at this point in the history
Summary:
Guessing the title of a document from metadata is not a winning strategy. The textbox with the biggest font may not be the title at all, could be an author, or a barcode using barcode image fonts [1].

The added heuristics of expecting a space ("very unlikely" would be the case for most academic papers, but it is very possible for a document to just be called "Statement" or "Invoice"), and even worse rejecting any title that has the word "Microsoft" (to work around documents exported by Word) is very finnicky, as it would reject titles such as "Analysis of somethingsomething feature on Microsoft Windows", which is clearly a valid title.

Instead, don't try to be smart about the title extraction. Trust what the file says its title is, and if there is no title, live with it. This is simpler, faster and requires less code.

[1] For instance FiServ-generated creditcard statements. See https://blog.flameeyes.eu/2017/09/how-i-leaked-my-own-credit-card-number/ for details.

Reviewers: #frameworks, aacid, mgallien, bruns

Reviewed By: mgallien, bruns

Subscribers: kde-frameworks-devel, #baloo, bruns, michaelh, anthonyfieroni, mgallien, vhanda, ngraham, #frameworks

Tags: #frameworks, #baloo

Differential Revision: https://phabricator.kde.org/D8007
  • Loading branch information
Flameeyes authored and Pointedstick committed May 10, 2018
1 parent 97618af commit 1491c4a
Showing 1 changed file with 0 additions and 88 deletions.
88 changes: 0 additions & 88 deletions src/extractors/popplerextractor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,6 @@ void PopplerExtractor::extract(ExtractionResult* result)

QString title = pdfDoc->info(QStringLiteral("Title")).trimmed();

// The title extracted from the pdf metadata is in many cases not the real title
// of the document. Especially for research papers that are exported to pdf.
// As mostly the title of a pdf document is written on the first page in the biggest font
// we use this if the pdfDoc title is considered junk
if (title.isEmpty() ||
!title.contains(QLatin1Char(' ')) || // very unlikely the title of a document does only contain one word.
title.contains(QStringLiteral("Microsoft"), Qt::CaseInsensitive)) { // most research papers i found written with microsoft word
// have a garbage title of the pdf creator rather than the real document title
title = parseFirstPage(pdfDoc.data(), fileUrl);
}

if (!title.isEmpty()) {
result->add(Property::Title, title);
}
Expand Down Expand Up @@ -103,80 +92,3 @@ void PopplerExtractor::extract(ExtractionResult* result)
result->append(page->text(QRectF()));
}
}

QString PopplerExtractor::parseFirstPage(Poppler::Document* pdfDoc, const QString& fileUrl)
{
QScopedPointer<Poppler::Page> p(pdfDoc->page(0));

if (!p) {
qWarning() << "Could not read page content from" << fileUrl;
return QString();
}

QList<Poppler::TextBox*> tbList = p->textList();
QMap<int, QString> possibleTitleMap;

int currentLargestChar = 0;
int skipTextboxes = 0;

// Iterate over all textboxes. Each textbox can be a single character/word or textblock
// Here we combine the etxtboxes back together based on the textsize
// Important are the words with the biggest font size
foreach(Poppler::TextBox * tb, tbList) {

// if we added followup words, skip the textboxes here now
if (skipTextboxes > 0) {
skipTextboxes--;
continue;
}

int height = tb->charBoundingBox(0).height();

// if the following text is smaller than the biggest we found up to now, ignore it
if (height >= currentLargestChar) {
QString possibleTitle;
possibleTitle.append(tb->text());
currentLargestChar = height;

// if the text has follow up words add them to to create the full title
Poppler::TextBox* next = tb->nextWord();
while (next) {
possibleTitle.append(QLatin1Char(' '));
possibleTitle.append(next->text());
next = next->nextWord();
skipTextboxes++;
}

// now combine text for each font size together, very likeley it must be connected
QString existingTitlePart = possibleTitleMap.value(currentLargestChar, QString());
existingTitlePart.append(QLatin1Char(' '));
existingTitlePart.append(possibleTitle);
possibleTitleMap.insert(currentLargestChar, existingTitlePart);
}
}

qDeleteAll(tbList);

QList<int> titleSizes = possibleTitleMap.keys();
qSort(titleSizes.begin(), titleSizes.end(), qGreater<int>());

QString newPossibleTitle;

// find the text with the largest font that is not just 1 character
foreach(int i, titleSizes) {
QString title = possibleTitleMap.value(i);

// sometime the biggest part is a single letter
// as a starting paragraph letter
if (title.size() < 5) {
continue;
} else {
newPossibleTitle = title.trimmed();
break;
}
}

// Sometimes the titles that are extracted are too large. This is a way of trimming them.
newPossibleTitle.truncate(50);
return newPossibleTitle;
}

0 comments on commit 1491c4a

Please sign in to comment.