Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibility of a com.norconex.collector.http.data.store.impl.dynamodb ? #448

Open
obxpete opened this issue Dec 28, 2017 · 3 comments
Open

Comments

@obxpete
Copy link

obxpete commented Dec 28, 2017

Along the lines of the mongodb driver a dynamodb driver would be great. If there are no plans/bandwidth to make one, please post any guidance here for a novice java developer to get started.

@essiembre
Copy link
Contributor

Are you talking about the URL crawl store? If so, there are no plans to have a dynamodb implementation, but we can make this a feature request and we'll get to it if there is enough demand.

A crawl store is a cache of what has already been crawled (e.g., to help detect modifications and deletions) and is not meant to store all content + metadata. For this, you need a Committer.

If you want to create a new crawl store, look at how the current ones are implemented, such as MongoDB here and there.

Committers should be simpler to implement and they usually are what you want. Have a look here to get started. Again, you may want to check how existing ones were done.

Does this help?

If you end up creating your own, let us know!

@obxpete
Copy link
Author

obxpete commented Jan 5, 2018 via email

@danizen
Copy link

danizen commented Mar 30, 2018

+1 - DynamoDB is easy in AWS and a crawler like norconex has a calculable number of requests, which fits the DynamoDB provisioning system. DynamoDB crawlstore + S3StatusStore would mean that Norconex collector can be run on an AWS spot instance once a week, and people save loads.

Current system, the best case would be MVstore on a persistent EBS volume that you reload to recrawl the next week or what not, but you still need some way to get the status off the box, e.g. periodic s3 sync etc.

DynamoDB crawlstore + S3StatusStore would be a nearly complete solution without the result to DevOps like scripting to get it done (like another guys mkfifo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants