Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFs not indexed through file content ingestion #4488

Closed
D-Groenewegen opened this issue Jan 29, 2020 · 11 comments
Closed

PDFs not indexed through file content ingestion #4488

D-Groenewegen opened this issue Jan 29, 2020 · 11 comments
Labels

Comments

@D-Groenewegen
Copy link
Contributor

D-Groenewegen commented Jan 29, 2020

Hello from Wikibase Solutions!

This is a follow-up on a question recently raised on the SMW website.

The use case

  • MediaWiki 1.31.3
  • PHP 7.0.33 (apache2handler)
  • mySQL 5.7.28
  • ElasticSearch 5.6.16
  • SMW 3.1.1

We are in love with ElasticStore for SMW+Elasticsearch and its potential for integrating structured and unstructured searches.

We have lots of PDF files in the wiki and we want them to be findable and searchable using a combination of structured queries based on certain metadata and unstructured, full-text searches. These PDF files are enriched with additional metadata on pages other than the file page. It would be wonderful if we could run queries like:

[[-File.Date::in:2019]] [[-File.Type::Magazine]] [[in:propeller hartzell]].
Meaning (or meaning to represent):

  • (a) give me all files (1) that have a linked page through the 'File' property with Date in 2019 and with Type Magazine, and (2) that contain the words "propeller" or "hartzell"
  • (b) and return them in order of relevancy.

Although at present, the files are nicely searchable with CirrusSearch, that doesn't let us work with SMW's data and querying features.

The problem

We have followed the procedure for ElasticStore's file content ingestion as described here. We enabled the ingest-attachment plugin and some extra properties are registered in the wiki ("File attachment").

We're not there yet, however. The new properties don't recognise the correct content type ("application/octet-stream") and content length ("0") and full-text searches still return no results. What it looks like is that out of the box, the ingest-attachment plugin is unable to handle (largely) non-binary documents like PDFs and that something is needed to get them converted to base64-encoded binaries as required.

Questions

(1) What would you suggest should be done? What's the approach you recommend or should become the default for SMW/ES users? I would have thought that Tika takes care of the conversion. I cannot check at the moment that Tika is installed and properly configured at all, but the documentation seems to assume that this is generally the case. Would you, for intance, recommend we use a third-party plugin like https://github.com/dadoonet/fscrawler (which in turn uses Tika)? FWIW, the way CirrusSearch indexes PDF files is through the PDFHandler extension. Not sure if that is worth taking into consideration here.

(2) If this issue calls for further development work, is there any way we can help to solve it? We'd be happy to make a contribution!

(3) Please let me know if there any plans to improve integration between SMW and ES in one way or another.

Edit: Refs #3054 -kghbln

@kghbln
Copy link
Member

kghbln commented Jan 29, 2020

Thanks for asking. Just to make sure: The file however was ingested? E.g. on sandbox file ingestion works with a simple query like [[in:Human Development Index]] does yield a result using this smw specific syntax.

@D-Groenewegen
Copy link
Contributor Author

D-Groenewegen commented Jan 29, 2020

Thanks for your answer and the useful links.

Oh, there's certainly more than one file [edit: 2718, most of them PDFs] and it does not look like file ingestion has been forthcoming at all. All that's happened is that the File attachment annotation appeared on the file page.

I guess we'll have to repeat the process and take a closer look at the logs. It could be that the smw.elasticFileIngest job got stuck at some point.

@kghbln
Copy link
Member

kghbln commented Jan 30, 2020

I figure something is causing pain with the ingestion, assuming that "experimental.file.ingest": true is set. Also have a look at the "rebuildElasticMissingDocuments.php" maintenance script.

@D-Groenewegen
Copy link
Contributor Author

D-Groenewegen commented Jan 30, 2020

It is, in this way:

$GLOBALS['smwgElasticsearchConfig']['indexer']['experimental.file.ingest'] = true;
(and of course, we can verify settings in Special:SemanticMediaWiki&action=settings)

Unfortunately, rebuildElasticMissingDocuments.php is unable to detect any missing pages/documents.

Are there any diagnostics we can perform on the smw.elasticFileIngest job, which I presume is separate from MW's job queue?

P.S. I failed to mention above that the server runs on Ubuntu and that the wiki is part of a farm.

@kghbln
Copy link
Member

kghbln commented Jan 31, 2020

So far I was lucky that it worked out of the box. I guess @mwjames will be able to help you much faster and sophisticated about how to debug this.

@mwjames
Copy link
Contributor

mwjames commented Feb 1, 2020

I'm fairly pressed with time therefore I can only give a couple of notes.

PS: If you have specific questions that are deterministic and quickly observable I might be able to spend a couple of minutes but I cannot answer generalized questions.

As I repeatedly said before, If you think the documentation is insufficient, you can ask specific questions, or even better improve them yourself.

enriched with additional metadata on pages other than the file page. It would be wonderful if we could run queries like:

You can do that, for example (on the sandbox):

I just (2020-02-01) uploaded PMC6357155.pdf and PMC4369385.pdf which after the injest job was completed let's you search for something like within the PDF:

(b) and return them in order of relevancy.

For details, see the documentation [2].

Although at present, the files are nicely searchable with CirrusSearch, that doesn't let us work with SMW's data and querying features.

I cannot confirm that statement, see above.

I guess @mwjames will be able to help you much faster and sophisticated about how to debug this.

Sorry but I don't have much time for this, especially given that this is an experimental feature and users are asked to actively provide information to make this a stable feature.

All I can say is that it works on the sandbox and on those wikis I have been using the ElasticStore (ES 6.8.*) together with the ingest-attachment plugin on PDF sizes ranging from 2 MB up to 120 MB without encountering any of the mentioned issues.

If you think you have an issue with a particular file format or content, you can use the sandbox for demonstration purposes, but please be reminded on the matter of how Elasticsearch bundles a specific Tika version (see below).

Unfortunately, rebuildElasticMissingDocuments.php is unable to detect any missing pages/documents.

The wording here is incorrect, the script inspects entities (aka. pages, documents) while file ingestion and the content produced by it (only to be available in Elasticsearch) is not an entity in context of Semantic MediaWiki. The script only compares data known to exists in the wiki with that expected to be stored in Elasticsearch. As outlined in [1], the File attachment container is the only yardstick within the wiki to be able to identify whether the ingestion was successful or not.

One could extend rebuildElasticMissingDocuments.php to take into account File attachment property but that is out of scope of this ticket.

I did a couple of changes to clarify what and how "File ingestion" is expected to work therefore I urge everybody to re-read [0, 1] and the official Elasticsearch documentation.

The new properties don't recognise the correct content type ("application/octet-stream") and content length ("0") and full-text searches still return no results. What it looks like is that out of the box, the ingest-attachment plugin is unable to handle (largely) non-binary

As noted in [1] "The quality of the text indexed and the information provided by the File attachment properties depends solely on Elasticsearch and Tika (a specific Tika version is bundled with a specific Elasticsearch release).

Any issues with the quality of indexed content or the recognition of specific information about a file (e.g. type, date, author etc.) has to be addressed in Elasticsearch and is not part or the scope of Semantic MediaWiki.".

Would you, for intance, recommend we use a third-party plugin like https://github.com/dadoonet/fscrawler (which in turn uses Tika)? FWIW, the way CirrusSearch indexes PDF files is through the PDFHandler extension. Not sure if that is worth taking into consideration here.

We rely on Elasticsearch and the bundled Tika to do the job as outlined in the documentation and there is no use case that would involve another extension or a different process other than that provided by Elasticsearch itself and the ingest-attachment plugin.

(1) What would you suggest should be done? What's the approach you recommend or should become the default for SMW/ES users? I would have thought that Tika takes care of the conversion. I cannot

See my response above and [1].

(2) If this issue calls for further development work, is there any way we can help to solve it? We'd be happy to make a contribution!

Unless you can pin-point to an issue in the actual implementation, I haven't planned any further development work.

(3) Please let me know if there any plans to improve integration between SMW and ES in one way or another.

The question is unclear as to what is expected here and how such improved integration would look like therefore specific suggestions or concrete proposals are welcome to evaluate whether that can be accomplished or not.

[0] https://www.semantic-mediawiki.org/wiki/Help:ElasticStore/File_ingestion
[1] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/replication.md
[2] https://github.com/SemanticMediaWiki/SemanticMediaWiki/blob/master/src/Elastic/docs/search.md

@kghbln
Copy link
Member

kghbln commented Feb 1, 2020

Thanks for taking the time to provide more insight and improving the existing documentation.

According to my experience it was pretty easy to setup and use file ingestion, admittedly not sure if I have done it with ES 5, so I was a bit surprised during the past days that it is not working for WBS.

@mwjames
Copy link
Contributor

mwjames commented Feb 9, 2020

@D-Groenewegen Even with the issue being closed, when I make an effort to explain some things I expect from the other party some form of acknowledgement otherwise I get the feeling that spending my volunteer time on answering questions or issues is a waste of my time.

@robis24
Copy link

robis24 commented Jan 8, 2021

Almost a year later, I found the problem, we are using a S3 bucket for storing the files, Tika was not able to access the S3 bucket, We had to disable the authentication on S3 and open the wiki for reading , after rebuilding the elastic index the pdf content ingestion worked , @mwjames Thanks for the detailed explanations

@D-Groenewegen
Copy link
Contributor Author

Robis beat me to it! I had (unfortunately) very sound reasons for not being able to respond at the time, but as you can read, the issue hasn't been lost in oblivion.

@26
Copy link

26 commented Jan 22, 2021

Hi, I am a colleague of @robis24 and @D-Groenewegen and also worked on this issue. The fix @robis24 described did not work to fix an identical issue we had on another MediaWiki installation of ours.

After some research, it turns out that the call to get_headers() on this line returned false, which caused the ingestion to eventually fail.

Looking in the PHP logs revealed the reason the call to get_headers failed:

PHP Warning:  get_headers(): This function may only be used against URLs

As it turns out, the get_headers() function fails to recognize $wgServer, since it is a protocol relative URL (e.g., //www.mediawiki.org). According to the $wgServer documentation, it is allowed to be protocol-relative since MediaWiki 1.18.0.

A check should probably be added to Semantic MediaWiki that validates the URL and prepends a protocol if necessary, or $wgCanonicalServer should be used instead. Alternatively, a different function than get_headers() should be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants