PDFs not indexed through file content ingestion #4488

D-Groenewegen · 2020-01-29T13:42:11Z

Hello from Wikibase Solutions!

This is a follow-up on a question recently raised on the SMW website.

The use case

MediaWiki 1.31.3
PHP 7.0.33 (apache2handler)
mySQL 5.7.28
ElasticSearch 5.6.16
SMW 3.1.1

We are in love with ElasticStore for SMW+Elasticsearch and its potential for integrating structured and unstructured searches.

We have lots of PDF files in the wiki and we want them to be findable and searchable using a combination of structured queries based on certain metadata and unstructured, full-text searches. These PDF files are enriched with additional metadata on pages other than the file page. It would be wonderful if we could run queries like:

[[-File.Date::in:2019]] [[-File.Type::Magazine]] [[in:propeller hartzell]].
Meaning (or meaning to represent):

(a) give me all files (1) that have a linked page through the 'File' property with Date in 2019 and with Type Magazine, and (2) that contain the words "propeller" or "hartzell"
(b) and return them in order of relevancy.

Although at present, the files are nicely searchable with CirrusSearch, that doesn't let us work with SMW's data and querying features.

The problem

We have followed the procedure for ElasticStore's file content ingestion as described here. We enabled the ingest-attachment plugin and some extra properties are registered in the wiki ("File attachment").

We're not there yet, however. The new properties don't recognise the correct content type ("application/octet-stream") and content length ("0") and full-text searches still return no results. What it looks like is that out of the box, the ingest-attachment plugin is unable to handle (largely) non-binary documents like PDFs and that something is needed to get them converted to base64-encoded binaries as required.

Questions

(1) What would you suggest should be done? What's the approach you recommend or should become the default for SMW/ES users? I would have thought that Tika takes care of the conversion. I cannot check at the moment that Tika is installed and properly configured at all, but the documentation seems to assume that this is generally the case. Would you, for intance, recommend we use a third-party plugin like https://github.com/dadoonet/fscrawler (which in turn uses Tika)? FWIW, the way CirrusSearch indexes PDF files is through the PDFHandler extension. Not sure if that is worth taking into consideration here.

(2) If this issue calls for further development work, is there any way we can help to solve it? We'd be happy to make a contribution!

(3) Please let me know if there any plans to improve integration between SMW and ES in one way or another.

Edit: Refs #3054 -kghbln

The text was updated successfully, but these errors were encountered:

kghbln · 2020-01-29T17:50:07Z

Thanks for asking. Just to make sure: The file however was ingested? E.g. on sandbox file ingestion works with a simple query like [[in:Human Development Index]] does yield a result using this smw specific syntax.

D-Groenewegen · 2020-01-29T23:23:04Z

Thanks for your answer and the useful links.

Oh, there's certainly more than one file [edit: 2718, most of them PDFs] and it does not look like file ingestion has been forthcoming at all. All that's happened is that the File attachment annotation appeared on the file page.

I guess we'll have to repeat the process and take a closer look at the logs. It could be that the smw.elasticFileIngest job got stuck at some point.

kghbln · 2020-01-30T09:05:34Z

I figure something is causing pain with the ingestion, assuming that "experimental.file.ingest": true is set. Also have a look at the "rebuildElasticMissingDocuments.php" maintenance script.

D-Groenewegen · 2020-01-30T12:30:27Z

It is, in this way:

$GLOBALS['smwgElasticsearchConfig']['indexer']['experimental.file.ingest'] = true;
(and of course, we can verify settings in Special:SemanticMediaWiki&action=settings)

Unfortunately, rebuildElasticMissingDocuments.php is unable to detect any missing pages/documents.

Are there any diagnostics we can perform on the smw.elasticFileIngest job, which I presume is separate from MW's job queue?

P.S. I failed to mention above that the server runs on Ubuntu and that the wiki is part of a farm.

kghbln · 2020-01-31T16:59:24Z

So far I was lucky that it worked out of the box. I guess @mwjames will be able to help you much faster and sophisticated about how to debug this.

mwjames · 2020-02-01T08:17:56Z

I'm fairly pressed with time therefore I can only give a couple of notes.

PS: If you have specific questions that are deterministic and quickly observable I might be able to spend a couple of minutes but I cannot answer generalized questions.

As I repeatedly said before, If you think the documentation is insufficient, you can ask specific questions, or even better improve them yourself.

enriched with additional metadata on pages other than the file page. It would be wonderful if we could run queries like:

You can do that, for example (on the sandbox):

I just (2020-02-01) uploaded PMC6357155.pdf and PMC4369385.pdf which after the injest job was completed let's you search for something like within the PDF:

(b) and return them in order of relevancy.

For details, see the documentation [2].

Although at present, the files are nicely searchable with CirrusSearch, that doesn't let us work with SMW's data and querying features.

I cannot confirm that statement, see above.

I guess @mwjames will be able to help you much faster and sophisticated about how to debug this.

Sorry but I don't have much time for this, especially given that this is an experimental feature and users are asked to actively provide information to make this a stable feature.

All I can say is that it works on the sandbox and on those wikis I have been using the ElasticStore (ES 6.8.*) together with the ingest-attachment plugin on PDF sizes ranging from 2 MB up to 120 MB without encountering any of the mentioned issues.

If you think you have an issue with a particular file format or content, you can use the sandbox for demonstration purposes, but please be reminded on the matter of how Elasticsearch bundles a specific Tika version (see below).

Unfortunately, rebuildElasticMissingDocuments.php is unable to detect any missing pages/documents.

The wording here is incorrect, the script inspects entities (aka. pages, documents) while file ingestion and the content produced by it (only to be available in Elasticsearch) is not an entity in context of Semantic MediaWiki. The script only compares data known to exists in the wiki with that expected to be stored in Elasticsearch. As outlined in [1], the File attachment container is the only yardstick within the wiki to be able to identify whether the ingestion was successful or not.

One could extend rebuildElasticMissingDocuments.php to take into account File attachment property but that is out of scope of this ticket.

I did a couple of changes to clarify what and how "File ingestion" is expected to work therefore I urge everybody to re-read [0, 1] and the official Elasticsearch documentation.

The new properties don't recognise the correct content type ("application/octet-stream") and content length ("0") and full-text searches still return no results. What it looks like is that out of the box, the ingest-attachment plugin is unable to handle (largely) non-binary

As noted in [1] "The quality of the text indexed and the information provided by the File attachment properties depends solely on Elasticsearch and Tika (a specific Tika version is bundled with a specific Elasticsearch release).

Any issues with the quality of indexed content or the recognition of specific information about a file (e.g. type, date, author etc.) has to be addressed in Elasticsearch and is not part or the scope of Semantic MediaWiki.".

Would you, for intance, recommend we use a third-party plugin like https://github.com/dadoonet/fscrawler (which in turn uses Tika)? FWIW, the way CirrusSearch indexes PDF files is through the PDFHandler extension. Not sure if that is worth taking into consideration here.

We rely on Elasticsearch and the bundled Tika to do the job as outlined in the documentation and there is no use case that would involve another extension or a different process other than that provided by Elasticsearch itself and the ingest-attachment plugin.

(1) What would you suggest should be done? What's the approach you recommend or should become the default for SMW/ES users? I would have thought that Tika takes care of the conversion. I cannot

See my response above and [1].

(2) If this issue calls for further development work, is there any way we can help to solve it? We'd be happy to make a contribution!

Unless you can pin-point to an issue in the actual implementation, I haven't planned any further development work.

(3) Please let me know if there any plans to improve integration between SMW and ES in one way or another.

The question is unclear as to what is expected here and how such improved integration would look like therefore specific suggestions or concrete proposals are welcome to evaluate whether that can be accomplished or not.

[0] https://www.semantic-mediawiki.org/wiki/Help:ElasticStore/File_ingestion
[1] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/replication.md
[2] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/search.md

kghbln · 2020-02-01T10:27:23Z

Thanks for taking the time to provide more insight and improving the existing documentation.

According to my experience it was pretty easy to setup and use file ingestion, admittedly not sure if I have done it with ES 5, so I was a bit surprised during the past days that it is not working for WBS.

mwjames · 2020-02-09T11:42:59Z

@D-Groenewegen Even with the issue being closed, when I make an effort to explain some things I expect from the other party some form of acknowledgement otherwise I get the feeling that spending my volunteer time on answering questions or issues is a waste of my time.

robis24 · 2021-01-08T15:43:50Z

Almost a year later, I found the problem, we are using a S3 bucket for storing the files, Tika was not able to access the S3 bucket, We had to disable the authentication on S3 and open the wiki for reading , after rebuilding the elastic index the pdf content ingestion worked , @mwjames Thanks for the detailed explanations

D-Groenewegen · 2021-01-09T12:06:20Z

Robis beat me to it! I had (unfortunately) very sound reasons for not being able to respond at the time, but as you can read, the issue hasn't been lost in oblivion.

26 · 2021-01-22T13:58:01Z

Hi, I am a colleague of @robis24 and @D-Groenewegen and also worked on this issue. The fix @robis24 described did not work to fix an identical issue we had on another MediaWiki installation of ours.

After some research, it turns out that the call to get_headers() on this line returned false, which caused the ingestion to eventually fail.

Looking in the PHP logs revealed the reason the call to get_headers failed:

PHP Warning:  get_headers(): This function may only be used against URLs

As it turns out, the get_headers() function fails to recognize $wgServer, since it is a protocol relative URL (e.g., //www.mediawiki.org). According to the $wgServer documentation, it is allowed to be protocol-relative since MediaWiki 1.18.0.

A check should probably be added to Semantic MediaWiki that validates the URL and prepends a protocol if necessary, or $wgCanonicalServer should be used instead. Alternatively, a different function than get_headers() should be used.

kghbln added the question label Jan 29, 2020

mwjames mentioned this issue Feb 1, 2020

[ES][DOCS] Update documentation #4491

Merged

2 tasks

kghbln closed this as completed Feb 1, 2020

This was referenced Feb 2, 2020

Elasticsearch, file ingestion, and the display of text excerpts #4503

Closed

[ES] Add ReplicationError #4504

Merged

Elasticsearch, file ingestion, and the "Content keyword" property #4528

Closed

mwjames mentioned this issue Feb 16, 2020

Conditions to search text *anywhere* within an article #3880

Closed

26 mentioned this issue Jan 22, 2021

PHP Warning: get_headers(): This function may only be used against URLs #4918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFs not indexed through file content ingestion #4488

PDFs not indexed through file content ingestion #4488

D-Groenewegen commented Jan 29, 2020 •

edited by kghbln

kghbln commented Jan 29, 2020 •

edited

D-Groenewegen commented Jan 29, 2020 •

edited

kghbln commented Jan 30, 2020 •

edited

D-Groenewegen commented Jan 30, 2020 •

edited

kghbln commented Jan 31, 2020

mwjames commented Feb 1, 2020 •

edited

kghbln commented Feb 1, 2020

mwjames commented Feb 9, 2020 •

edited

robis24 commented Jan 8, 2021

D-Groenewegen commented Jan 9, 2021

26 commented Jan 22, 2021 •

edited

PDFs not indexed through file content ingestion #4488

PDFs not indexed through file content ingestion #4488

Comments

D-Groenewegen commented Jan 29, 2020 • edited by kghbln

The use case

The problem

Questions

kghbln commented Jan 29, 2020 • edited

D-Groenewegen commented Jan 29, 2020 • edited

kghbln commented Jan 30, 2020 • edited

D-Groenewegen commented Jan 30, 2020 • edited

kghbln commented Jan 31, 2020

mwjames commented Feb 1, 2020 • edited

kghbln commented Feb 1, 2020

mwjames commented Feb 9, 2020 • edited

robis24 commented Jan 8, 2021

D-Groenewegen commented Jan 9, 2021

26 commented Jan 22, 2021 • edited

D-Groenewegen commented Jan 29, 2020 •

edited by kghbln

kghbln commented Jan 29, 2020 •

edited

D-Groenewegen commented Jan 29, 2020 •

edited

kghbln commented Jan 30, 2020 •

edited

D-Groenewegen commented Jan 30, 2020 •

edited

mwjames commented Feb 1, 2020 •

edited

mwjames commented Feb 9, 2020 •

edited

26 commented Jan 22, 2021 •

edited