New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDFs not indexed through file content ingestion #4488
Comments
Thanks for asking. Just to make sure: The file however was ingested? E.g. on sandbox file ingestion works with a simple query like |
Thanks for your answer and the useful links. Oh, there's certainly more than one file [edit: 2718, most of them PDFs] and it does not look like file ingestion has been forthcoming at all. All that's happened is that the File attachment annotation appeared on the file page. I guess we'll have to repeat the process and take a closer look at the logs. It could be that the |
I figure something is causing pain with the ingestion, assuming that |
It is, in this way:
Unfortunately, Are there any diagnostics we can perform on the P.S. I failed to mention above that the server runs on Ubuntu and that the wiki is part of a farm. |
So far I was lucky that it worked out of the box. I guess @mwjames will be able to help you much faster and sophisticated about how to debug this. |
I'm fairly pressed with time therefore I can only give a couple of notes. PS: If you have specific questions that are deterministic and quickly observable I might be able to spend a couple of minutes but I cannot answer generalized questions. As I repeatedly said before, If you think the documentation is insufficient, you can ask specific questions, or even better improve them yourself.
You can do that, for example (on the sandbox):
I just (2020-02-01) uploaded PMC6357155.pdf and PMC4369385.pdf which after the injest job was completed let's you search for something like within the PDF:
For details, see the documentation [2].
I cannot confirm that statement, see above.
Sorry but I don't have much time for this, especially given that this is an experimental feature and users are asked to actively provide information to make this a stable feature. All I can say is that it works on the sandbox and on those wikis I have been using the If you think you have an issue with a particular file format or content, you can use the sandbox for demonstration purposes, but please be reminded on the matter of how Elasticsearch bundles a specific Tika version (see below).
The wording here is incorrect, the script inspects entities (aka. pages, documents) while file ingestion and the content produced by it (only to be available in Elasticsearch) is not an entity in context of Semantic MediaWiki. The script only compares data known to exists in the wiki with that expected to be stored in Elasticsearch. As outlined in [1], the One could extend I did a couple of changes to clarify what and how "File ingestion" is expected to work therefore I urge everybody to re-read [0, 1] and the official Elasticsearch documentation.
As noted in [1] "The quality of the text indexed and the information provided by the File attachment properties depends solely on Elasticsearch and Tika (a specific Tika version is bundled with a specific Elasticsearch release). Any issues with the quality of indexed content or the recognition of specific information about a file (e.g. type, date, author etc.) has to be addressed in Elasticsearch and is not part or the scope of Semantic MediaWiki.".
We rely on Elasticsearch and the bundled Tika to do the job as outlined in the documentation and there is no use case that would involve another extension or a different process other than that provided by Elasticsearch itself and the
See my response above and [1].
Unless you can pin-point to an issue in the actual implementation, I haven't planned any further development work.
The question is unclear as to what is expected here and how such improved integration would look like therefore specific suggestions or concrete proposals are welcome to evaluate whether that can be accomplished or not. [0] https://www.semantic-mediawiki.org/wiki/Help:ElasticStore/File_ingestion |
Thanks for taking the time to provide more insight and improving the existing documentation. According to my experience it was pretty easy to setup and use file ingestion, admittedly not sure if I have done it with ES 5, so I was a bit surprised during the past days that it is not working for WBS. |
@D-Groenewegen Even with the issue being closed, when I make an effort to explain some things I expect from the other party some form of acknowledgement otherwise I get the feeling that spending my volunteer time on answering questions or issues is a waste of my time. |
Almost a year later, I found the problem, we are using a S3 bucket for storing the files, Tika was not able to access the S3 bucket, We had to disable the authentication on S3 and open the wiki for reading , after rebuilding the elastic index the pdf content ingestion worked , @mwjames Thanks for the detailed explanations |
Robis beat me to it! I had (unfortunately) very sound reasons for not being able to respond at the time, but as you can read, the issue hasn't been lost in oblivion. |
Hi, I am a colleague of @robis24 and @D-Groenewegen and also worked on this issue. The fix @robis24 described did not work to fix an identical issue we had on another MediaWiki installation of ours. After some research, it turns out that the call to Looking in the PHP logs revealed the reason the call to
As it turns out, the A check should probably be added to Semantic MediaWiki that validates the URL and prepends a protocol if necessary, or |
Hello from Wikibase Solutions!
This is a follow-up on a question recently raised on the SMW website.
The use case
We are in love with ElasticStore for SMW+Elasticsearch and its potential for integrating structured and unstructured searches.
We have lots of PDF files in the wiki and we want them to be findable and searchable using a combination of structured queries based on certain metadata and unstructured, full-text searches. These PDF files are enriched with additional metadata on pages other than the file page. It would be wonderful if we could run queries like:
[[-File.Date::in:2019]] [[-File.Type::Magazine]] [[in:propeller hartzell]].
Meaning (or meaning to represent):
Although at present, the files are nicely searchable with CirrusSearch, that doesn't let us work with SMW's data and querying features.
The problem
We have followed the procedure for ElasticStore's file content ingestion as described here. We enabled the ingest-attachment plugin and some extra properties are registered in the wiki ("File attachment").
We're not there yet, however. The new properties don't recognise the correct content type ("application/octet-stream") and content length ("0") and full-text searches still return no results. What it looks like is that out of the box, the ingest-attachment plugin is unable to handle (largely) non-binary documents like PDFs and that something is needed to get them converted to base64-encoded binaries as required.
Questions
(1) What would you suggest should be done? What's the approach you recommend or should become the default for SMW/ES users? I would have thought that Tika takes care of the conversion. I cannot check at the moment that Tika is installed and properly configured at all, but the documentation seems to assume that this is generally the case. Would you, for intance, recommend we use a third-party plugin like https://github.com/dadoonet/fscrawler (which in turn uses Tika)? FWIW, the way CirrusSearch indexes PDF files is through the PDFHandler extension. Not sure if that is worth taking into consideration here.
(2) If this issue calls for further development work, is there any way we can help to solve it? We'd be happy to make a contribution!
(3) Please let me know if there any plans to improve integration between SMW and ES in one way or another.
Edit: Refs #3054 -kghbln
The text was updated successfully, but these errors were encountered: