You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should offer a way for users to search through our documents on DocumentCloud. They offer a really nice API of their own, but I think it'd probably be better for us to pull the raw text data into our backend and allow searching through it on our main search interface.
The text was updated successfully, but these errors were encountered:
I think so? Does it make sense to split it off into a separate database?
Not storing/using data for the sake of having less data seems like a bad
general policy if having it would be useful, but I get the confer of giant
DBs.
On Sun, Mar 13, 2016 at 10:33 PM mitchelljkotler notifications@github.com
wrote:
Concern: if we are moving raw emails to files due to size issues in the
DB, do we want to have all the OCR data in the database?
—
Reply to this email directly or view it on GitHub #5 (comment).
Now that we have merged with DocumentCloud, we should find a way to directly search our DocumentCloud documents from the main MuckRock search.
Our anonymous user directly requested the ability to search the OCRed text of the documents - he noticed the direct PDFs are banned in robots.txt so Google does not OCR and index them, as we want people to visit the site for context instead of the PDFs directly. I'm not sure if there is a better way we could integrate this with Google, so that they will index the OCRed text and be able to search the request pages for those terms.
We should offer a way for users to search through our documents on DocumentCloud. They offer a really nice API of their own, but I think it'd probably be better for us to pull the raw text data into our backend and allow searching through it on our main search interface.
The text was updated successfully, but these errors were encountered: