popplerextractor: don't try to guess the title if there isn't one.

Summary: Guessing the title of a document from metadata is not a winning strategy. The textbox with the biggest font may not be the title at all, could be an author, or a barcode using barcode image fonts [1]. The added heuristics of expecting a space ("very unlikely" would be the case for most academic papers, but it is very possible for a document to just be called "Statement" or "Invoice"), and even worse rejecting any title that has the word "Microsoft" (to work around documents exported by Word) is very finnicky, as it would reject titles such as "Analysis of somethingsomething feature on Microsoft Windows", which is clearly a valid title. Instead, don't try to be smart about the title extraction. Trust what the file says its title is, and if there is no title, live with it. This is simpler, faster and requires less code. [1] For instance FiServ-generated creditcard statements. See https://blog.flameeyes.eu/2017/09/how-i-leaked-my-own-credit-card-number/ for details. Reviewers: #frameworks, aacid, mgallien, bruns Reviewed By: mgallien, bruns Subscribers: kde-frameworks-devel, #baloo, bruns, michaelh, anthonyfieroni, mgallien, vhanda, ngraham, #frameworks Tags: #frameworks, #baloo Differential Revision: https://phabricator.kde.org/D8007
KDE · May 10, 2018 · 1491c4a · 1491c4a
1 parent 97618af
commit 1491c4a
Showing 1 changed file with 0 additions and 88 deletions.
diff --git a/src/extractors/popplerextractor.cpp b/src/extractors/popplerextractor.cpp
@@ -54,17 +54,6 @@ void PopplerExtractor::extract(ExtractionResult* result)
 
     QString title = pdfDoc->info(QStringLiteral("Title")).trimmed();
 
-    // The title extracted from the pdf metadata is in many cases not the real title
-    // of the document. Especially for research papers that are exported to pdf.
-    // As mostly the title of a pdf document is written on the first page in the biggest font
-    // we use this if the pdfDoc title is considered junk
-    if (title.isEmpty() ||
-            !title.contains(QLatin1Char(' ')) ||                        // very unlikely the title of a document does only contain one word.
-            title.contains(QStringLiteral("Microsoft"), Qt::CaseInsensitive)) {  // most research papers i found written with microsoft word
-        // have a garbage title of the pdf creator rather than the real document title
-        title = parseFirstPage(pdfDoc.data(), fileUrl);
-    }
-
     if (!title.isEmpty()) {
         result->add(Property::Title, title);
     }
@@ -103,80 +92,3 @@ void PopplerExtractor::extract(ExtractionResult* result)
         result->append(page->text(QRectF()));
     }
 }
-
-QString PopplerExtractor::parseFirstPage(Poppler::Document* pdfDoc, const QString& fileUrl)
-{
-    QScopedPointer<Poppler::Page> p(pdfDoc->page(0));
-
-    if (!p) {
-        qWarning() << "Could not read page content from" << fileUrl;
-        return QString();
-    }
-
-    QList<Poppler::TextBox*> tbList = p->textList();
-    QMap<int, QString> possibleTitleMap;
-
-    int currentLargestChar = 0;
-    int skipTextboxes = 0;
-
-    // Iterate over all textboxes. Each textbox can be a single character/word or textblock
-    // Here we combine the etxtboxes back together based on the textsize
-    // Important are the words with the biggest font size
-    foreach(Poppler::TextBox * tb, tbList) {
-
-        // if we added followup words, skip the textboxes here now
-        if (skipTextboxes > 0) {
-            skipTextboxes--;
-            continue;
-        }
-
-        int height = tb->charBoundingBox(0).height();
-
-        // if the following text is smaller than the biggest we found up to now, ignore it
-        if (height >= currentLargestChar) {
-            QString possibleTitle;
-            possibleTitle.append(tb->text());
-            currentLargestChar = height;
-
-            // if the text has follow up words add them to to create the full title
-            Poppler::TextBox* next = tb->nextWord();
-            while (next) {
-                possibleTitle.append(QLatin1Char(' '));
-                possibleTitle.append(next->text());
-                next = next->nextWord();
-                skipTextboxes++;
-            }
-
-            // now combine text for each font size together, very likeley it must be connected
-            QString existingTitlePart = possibleTitleMap.value(currentLargestChar, QString());
-            existingTitlePart.append(QLatin1Char(' '));
-            existingTitlePart.append(possibleTitle);
-            possibleTitleMap.insert(currentLargestChar, existingTitlePart);
-        }
-    }
-
-    qDeleteAll(tbList);
-
-    QList<int> titleSizes = possibleTitleMap.keys();
-    qSort(titleSizes.begin(), titleSizes.end(), qGreater<int>());
-
-    QString newPossibleTitle;
-
-    // find the text with the largest font that is not just 1 character
-    foreach(int i, titleSizes) {
-        QString title = possibleTitleMap.value(i);
-
-        // sometime the biggest part is a single letter
-        // as a starting paragraph letter
-        if (title.size() < 5) {
-            continue;
-        } else {
-            newPossibleTitle = title.trimmed();
-            break;
-        }
-    }
-
-    // Sometimes the titles that are extracted are too large. This is a way of trimming them.
-    newPossibleTitle.truncate(50);
-    return newPossibleTitle;
-}