Allow searching through DocumentCloud documents #5

morisy · 2014-10-16T05:17:35Z

We should offer a way for users to search through our documents on DocumentCloud. They offer a really nice API of their own, but I think it'd probably be better for us to pull the raw text data into our backend and allow searching through it on our main search interface.

mitchelljkotler · 2016-03-14T02:33:35Z

Concern: if we are moving raw emails to files due to size issues in the DB, do we want to have all the OCR data in the database?

morisy · 2016-03-14T02:39:45Z

I think so? Does it make sense to split it off into a separate database?
Not storing/using data for the sake of having less data seems like a bad
general policy if having it would be useful, but I get the confer of giant
DBs.
On Sun, Mar 13, 2016 at 10:33 PM mitchelljkotler notifications@github.com
wrote:

Concern: if we are moving raw emails to files due to size issues in the
DB, do we want to have all the OCR data in the database?

—
Reply to this email directly or view it on GitHub
#5 (comment).

mitchelljkotler · 2020-06-16T11:55:00Z

Now that we have merged with DocumentCloud, we should find a way to directly search our DocumentCloud documents from the main MuckRock search.

Our anonymous user directly requested the ability to search the OCRed text of the documents - he noticed the direct PDFs are banned in robots.txt so Google does not OCR and index them, as we want people to visit the site for context instead of the PDFs directly. I'm not sure if there is a better way we could integrate this with Google, so that they will index the OCRed text and be able to search the request pages for those terms.

morisy added the low label Oct 16, 2014

allanlasser added the nice to have label Dec 10, 2014

morisy assigned mitchelljkotler Dec 29, 2014

mitchelljkotler added the database label Apr 9, 2016

morisy added the documentcloud label Dec 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow searching through DocumentCloud documents #5

Allow searching through DocumentCloud documents #5

morisy commented Oct 16, 2014

mitchelljkotler commented Mar 14, 2016

morisy commented Mar 14, 2016

mitchelljkotler commented Jun 16, 2020

Allow searching through DocumentCloud documents #5

Allow searching through DocumentCloud documents #5

Comments

morisy commented Oct 16, 2014

mitchelljkotler commented Mar 14, 2016

morisy commented Mar 14, 2016

mitchelljkotler commented Jun 16, 2020