Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow searching through DocumentCloud documents #5

Open
morisy opened this issue Oct 16, 2014 · 3 comments
Open

Allow searching through DocumentCloud documents #5

morisy opened this issue Oct 16, 2014 · 3 comments

Comments

@morisy
Copy link
Member

morisy commented Oct 16, 2014

We should offer a way for users to search through our documents on DocumentCloud. They offer a really nice API of their own, but I think it'd probably be better for us to pull the raw text data into our backend and allow searching through it on our main search interface.

@mitchelljkotler
Copy link
Member

Concern: if we are moving raw emails to files due to size issues in the DB, do we want to have all the OCR data in the database?

@morisy
Copy link
Member Author

morisy commented Mar 14, 2016

I think so? Does it make sense to split it off into a separate database?
Not storing/using data for the sake of having less data seems like a bad
general policy if having it would be useful, but I get the confer of giant
DBs.
On Sun, Mar 13, 2016 at 10:33 PM mitchelljkotler notifications@github.com
wrote:

Concern: if we are moving raw emails to files due to size issues in the
DB, do we want to have all the OCR data in the database?


Reply to this email directly or view it on GitHub
#5 (comment).

@mitchelljkotler
Copy link
Member

Now that we have merged with DocumentCloud, we should find a way to directly search our DocumentCloud documents from the main MuckRock search.

Our anonymous user directly requested the ability to search the OCRed text of the documents - he noticed the direct PDFs are banned in robots.txt so Google does not OCR and index them, as we want people to visit the site for context instead of the PDFs directly. I'm not sure if there is a better way we could integrate this with Google, so that they will index the OCRed text and be able to search the request pages for those terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants